[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Dave Meikle (JIRA) Wed, 19 Nov 2014 02:22:08 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217685#comment-14217685
 ]


Dave Meikle commented on TIKA-1445:
-----------------------------------

bq. Hey Guys, to be honest, the way I see that we solve the ServiceLoading 
problem is somehow to move away from it. Relying on the JVM to implicitly 
decide which parser to load based on ClassLoading is not scalable IMO. At 
worst, even capturing an ordered preference file that isn't ServiceLoading is 
1000x better IMO than relying on the JVM and the classpath. We need somehow to 
bring this logic into Tika (still thinking about how and will try to prototype 
something).

+1 - I think this is example of something we will probably hit more and more as 
we further extend Tika, i.e. wanting multiple parsers to have an interest in 
and then parse content of the same mime type, and moving away from using the 
re-ordering approach seems like the only way to go here.

_ServiceLoading_ per se is not a problem, indeed this is a nice way to make it 
simple for external providers to be added, but I think we need to think about 
Parsers in a pipeline and allow users to customise the parsers that participate 
in the pipeline through positive exclusions via config.

The above is a big change and I think if we went with something like this would 
need to be a 2.X of Tika. 

I suspect the problem with clashing Metadata entries is not really there, as 
most parsers look for different keys, or in cases where they process commons 
ones (e.g. title, size, description, etc) they should hopefully be getting the 
same value anyway.  IMO I think we could send the same Metadata object through 
the 'pipeline', adding any unique new value in for a key.

Will join the party and try to flesh out thoughts on a branch.

bq. 3) It is a good idea to identify which parser produced each content with a 
<div> tag.

+1 - this will be really helpful.

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to