[jira] [Comment Edited] (TIKA-2434) Language detection slow, cpu intensive, CLI interrupts work

Tim Allison (JIRA) Tue, 01 Aug 2017 09:46:28 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109266#comment-16109266
 ]


Tim Allison edited comment on TIKA-2434 at 8/1/17 4:45 PM:
-----------------------------------------------------------

1) Great!  [~chrismattmann], recommendations for adding headless to the brew 
script?  Can anyone see any fall-out from running tika in headless mode?  I 
should probably run tika headless against our regression corpus to see if there 
are any diffs.

2) In TIKA-2374, [~gagravarr] requested that this be added for -z option.  
However, I thought it would be bizarre for a user to be able to extract all 
images, but then not get text via OCR on those images.  [~gagravarr], should I 
back-off and do just this: extract inline images only for -z but not for text 
extraction?  Or, should we leave this as is?  

So that I understand, you want to run OCR on regular "attachment" images inside 
PDFs but not on their inline images?


was (Author: [email protected]):
1) Great!  [~chrismattmann], recommendations for adding headless to the brew 
script?  Can anyone see any fall-out from running tika in headless mode?  I 
should probably run tika headless against our regression corpus to see if there 
are any diffs.

2) In TIKA-2374, [~gagravarr] requested that this be added for -z option.  
However, I thought it would be bizarre for a user to be able to extract all 
images, but then not get text via OCR on those images.  [~gagravarr], should I 
back-off and do just this: extract inline images only for -z but not for text 
extraction?  Or, should we leave this as is?  

So that I understand, you want to run OCR on the PDFs but not on their inline 
images?

> Language detection slow, cpu intensive, CLI interrupts work
> -----------------------------------------------------------
>
>                 Key: TIKA-2434
>                 URL: https://issues.apache.org/jira/browse/TIKA-2434
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.16
>         Environment: OS X 10.11.6, JRE 1.8.0_25
>            Reporter: Stefan Karner
>
> Since version 1.16, when using tika -l FILE, it takes a lot longer than e.g. 
> 1.15.
> Also, when batch processing a bunch of files in the background, the Java 
> runtime icon pops up when processing the next file, stealing the input focus 
> from whatever other application I'm currently working on, thus constantly 
> interrupting my work.
> Also, the Java runtime uses from 100% to 400% CPU when executing Tika.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (TIKA-2434) Language detection slow, cpu intensive, CLI interrupts work

Reply via email to