[ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866080#comment-13866080
 ] 

Lewis John McGibbney edited comment on TIKA-1208 at 1/9/14 12:35 AM:
---------------------------------------------------------------------

Hi [~p_ansell], I have been working on a patch for this issue... which I did 
not wish to push it to Jira... however I've been taken off course by bugs in a 
Gora branch and I would like for us all (Any23 team) to propose this... if 
possible.

I attach a patch for migrating Any23 mime package to Tika which retains the 
Purifier concept of cleaning documents prior to them being processed for 
mime/mediaType detection. I've not touched  the Tika API or the Dectect API 
within this implementation as (I personally think) it would be more of a task 
to succeed in the code migration if we attempt to change well known and well 
designed 'detect' and base 'Tika' API's e.g adding a Purifier parameter to 
method construction.

This therefore means that if we are to retain the concept of the Purifier 
interface, then implementations are detector specific... right now all we can 
offer (from Any23) is the WhiteSpacePurifier which is OK... but implementing 
the functionality in this manner is NOT configurable e.g. if someone wished to 
pass a custom Purifier as a parameter to detect(InputStream, Metedata, 
Purifier). I personally think that if other Purifier's were to be introduced 
then we could revisit this issue and possibly propose a change to various Tika 
interfaces so that detectors are parameter-aware of Purifier's.

Apart from that, this (WIP) patch introduces an Any23Detector which basically 
stems from the TikaMIMETypeDetector we maintained in Any23... please comment on 
this as I am not sure if this is the right way to process... there are most 
likely issues with the implementation I have coded.          

THIS PATCH IS MERELY A START... I would really appreciate input from the Any23 
team to see if I am 'attempting' to implement the Any23 mime code in the 
correct way that we think is suitable for migration to tika-core.

It should also be noted that the last time I ran this patch with Tika trunk 
there were issues with detection of 'semantic' mime types. 

Hopefully this is a start which we can build from. I am committed to getting 
this code suitable for proposal to Tika.

N.B. This patch also addresses ALL this Java elements that cause a warnings 
from within the entire codebase, so it looks like a lot more than it actually 
is.

Any comments are VERY much appreciated.   


was (Author: lewismc):
Hi [~p_ansell], I have been working on a patch for this issue... which I did 
not wish to push ti Jira... however I've been taken off course by bugs in a 
Gora branch.

I attach a patch for migrating Any23 mime package to Tika which retains the 
Purifier concept of cleaning documents prior to them being processed for 
mime/mediaType detection. I've not touched  the Tika API or the Dectect API 
within this implementation as (I personally think) it would be more of a task 
to succeed in the code migration if we attempt to change well know and well 
designed 'dectect' and base 'Tika' API's.

This therefore means that Purifier implementations are detector specific... 
right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it 
NOT configurable e.g. if someone wished to pass a Purifier as a parameter to 
detect(InputStream, Metedata, Purifier) ... and I think that if other 
Purifier's were to be introduced then we could revisit this issue.

Apart from that, this (WIP) patch introduces an Any23Detector which basically 
stems from the Tika detector we maintained in Any23... please comment on this 
as I am not sure if this is the right way to process...           

THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 
'attempting' to implement the Any23 mime code in the correct way.

It should also be noted that the last time I ran this patch with Tika trunk 
there were issues with detection of 'semantic' mime types.

Hopefully this is a start which we can build from. I am committed to getting 
this code suitable for proposal to Tika.

N.B. This patch also addresses ALL this Java elements that cause a warnings 
from within the entire codebase. 

Any comment are VERY appreciated.   

> Migrate Any23 mime contributions to Tika
> ----------------------------------------
>
>                 Key: TIKA-1208
>                 URL: https://issues.apache.org/jira/browse/TIKA-1208
>             Project: Tika
>          Issue Type: Sub-task
>          Components: mime
>            Reporter: Lewis John McGibbney
>             Fix For: 1.5
>
>         Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to