Hi all,

I need some guidance on how to drive TIKA-2970 to conclusion.   I’ve created a 
unit test that demonstrates that when you configure Tesseract via tika-config 
properties, that TesseractOCRConfig is ignored when running Tika on the command 
line by other Parsers, but not when Tika is running as a Server process ;-)

I’m hoping to get this fix in for 1.23, as it’ll make my deployment life much 
simpler to have everything in one config file, and not have the .properties!

1) The approach I took was somewhat mimicking the extractInlineImagesFromPDFS() 
method, which was to add another check:
https://github.com/apache/tika/pull/291/files#diff-6cece27f460f1f26d0bda270557e2bc5R200
Is this the best way?   I feel like one of the initialization methods should 
have worked, but it seemed like I never could get access to the context object 
to put my custom config.    

2) The unit test actually runs the Tesseract process.  Thoughts on how to 
improve the unit test?   To be less of an integration test?

3) I coded against the master branch, is that the right way to do this?  Versus 
branch_1x.

4) Lastly, would we want to support the fillMetadata logic 
(https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L307)
 in the command line version as well?   I don’t need it, and it feels like it 
might complicate the parameters even more, but happy to take a stab at that if 
we want.

Thanks!

Eric


_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to