[jira] [Updated] (TIKA-93) OCR support

2015-04-06 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-93:
-
Labels: memex  (was: )

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: memex
 Fix For: 1.7

 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, 
 TesseractOCRParser.patch, TesseractOCR_Tyler.patch, 
 TesseractOCR_Tyler_v2.patch, TesseractOCR_Tyler_v3.patch, 
 TesseractOCR_Tyler_v4.patch, testOCR.docx, testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-93) OCR support

2014-09-18 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-93:

Attachment: TesseractOCR_Tyler_v4.patch

Thank you for the input! I attached a new patch (v4) which uses `junit.Assume` 
to ignore the tests if Tesseract is not installed and cleans up some of the 
Exception throwing. 

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7

 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, 
 TesseractOCRParser.patch, TesseractOCR_Tyler.patch, 
 TesseractOCR_Tyler_v2.patch, TesseractOCR_Tyler_v3.patch, 
 TesseractOCR_Tyler_v4.patch, testOCR.docx, testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-93) OCR support

2014-09-15 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-93:

Assignee: Chris A. Mattmann  (was: Tyler Palsulich)

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7

 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, 
 TesseractOCRParser.patch, TesseractOCR_Tyler.patch, 
 TesseractOCR_Tyler_v2.patch, testOCR.docx, testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-93) OCR support

2014-09-15 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-93:

Attachment: TesseractOCR_Tyler_v3.patch

Updated patch which passes all tests whether Tesseract is installed or not. I 
updated the review board, too. See https://reviews.apache.org/r/22402/.

Also, whoops, I hit a hotkey to assign the issue to me.

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7

 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, 
 TesseractOCRParser.patch, TesseractOCR_Tyler.patch, 
 TesseractOCR_Tyler_v2.patch, TesseractOCR_Tyler_v3.patch, testOCR.docx, 
 testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-93) OCR support

2014-08-22 Thread Petr Vas (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petr Vas updated TIKA-93:
-

Attachment: Petr_tika-config.xml

Sure, here is config.
Source code that I am currently using cn be found here: 
https://github.com/datanav/tika/tree/ocr-tika-server (forked version of 
Apache's repo with custom branch)

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7

 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, 
 TesseractOCRParser.patch, TesseractOCR_Tyler.patch, 
 TesseractOCR_Tyler_v2.patch, testOCR.docx, testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-93) OCR support

2014-06-09 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-93:


Attachment: TesseractOCR_Tyler_v2.patch

Minor updates to the patch: Moved the OCRParser to tika-parsers (unless others 
think it should be in tika-core?), moved the files from test-documents/ocr to 
just test-documents.
In PDFParserTest, I added testOCR.pdf to the list of known metadataDiff, since 
the PDF version is different for the NonSeq and Seq PDFBox parsers.

In tika-server TikaMimeTypesTest, I changed testGetJSON() -- will someone look 
at this part? Something seems weird about it.

There still needs to be a check for if Tesseract is installed, and where. I 
looked a bit at the ExternalParser code -- it seems useful, but I'm not sure 
how to combine TesseractOCRParser and ExternalParser. Can someone else chime 
in? At this point, I don't think we need more than a call to 
ExternalParser.check(). But, I could be wrong.

In my opinion, we should just require that Tesseract be on the user's path. 
It's an uncommon program. So, if a user installs it, it will probably be *for* 
Tika OCR. So, it's not a big deal for them to put it on their path.

I put up a review: https://reviews.apache.org/r/22402/. I don't think this is 
ready yet, but I'd like to get it moving.

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7

 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
 TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, 
 testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-93) OCR support

2014-05-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-93:


Attachment: TesseractOCR_Tyler.patch

Awesome! I attached another patch which includes TesseractOCRParser.patch with 
unit tests for the parser (PDF, PPTX, and DOCX files with embedded images with 
text). We could use more tests for images with no next, blurry text, and so on. 
But, I don't know how good Tesseract is.

Steps to apply this patch: install Tesseract \[1\], apply the patch, move the 
test files into tika-parsers/src/test/resources/test-documents/ocr. Run the 
tests with {{mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest 
-DfailIfNoTests=false}}.

What needs to happen from here? How should we include Tesseract in the sources? 
How should we handle timeouts (give the user a warning that OCR can be 
slow/timed out)?

\[1\] - [https://code.google.com/p/tesseract-ocr/wiki/ReadMe]

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
 TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-93) OCR support

2014-02-23 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif updated TIKA-93:
---

Attachment: TesseractOCRParser.patch

Patch with first version of a tesseract-ocr based OCRParser, with simple 
timeout control.

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TesseractOCRParser.patch, testOCR.docx, testOCR.pdf, 
 testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-93) OCR support

2014-02-23 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif updated TIKA-93:
---

Attachment: TesseractOCRParser.patch

Better timeout control using FutureTask

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
 testOCR.docx, testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-93) OCR support

2014-02-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated TIKA-93:


Attachment: testOCR.pptx
testOCR.pdf
testOCR.docx
TIKA-93.patch

Not sure if this is progress or not...  

The testOCR.* files need to go in the parsers/src/test/resources/test-documents 
directory.

Things that changed:
# Moved config to ParseContext instead of one off implementation in 
PDFParserConfig.
# Used the existing ParseContext for passing in the OCRParser instead of 
separate handling
# Added some more test files.  Will upload them.

Things I could use help on:
# Trying to get this integrated into the Office stuff.  I see the 
DELEGATING_PARSER capabilities for embedded extraction, but not quite sure 
about how to best leverage that.  See JavaOCRParserTest.testOCR for some 
attempts at setting up the test
# Overall, my biggest lack of understanding is around how to configure this 
stuff.  As I see it, we need to be able to set 2 things: 
## The OCRParser or Delegatingparser.  I'm not sure how embedded contexts are 
used in practice.  Note, some of the OCRParser implementations will require 
configuration/training before they can be used.
## Whether or not to actually use the OCRParser (a boolean flag), as OCR is 
expensive and not everyone will want it for every doc, etc.

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, testOCR.docx, testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-93) OCR support

2014-02-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated TIKA-93:


Attachment: TIKA-93.patch

Here is a _very_ early stage patch that creates a JavaOCR parser.  It is not 
integrated into any of the other parsers, yet.

I also added Jacoco code coverage to the Parent POM so that we can now generate 
coverage reports.  For example:
# mvn verify  (from the top level)

Or, after running mvn test
# mvn jacoco:check

Once done, check the target/site/jacoco directory to see the reports.

Not sure on Tika workflow for JIRA, but if someone wants to Assign this Issue 
to me, I'll take it the next few steps. 

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
 Attachments: TIKA-93.patch


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-93) OCR support

2014-02-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated TIKA-93:


Attachment: TIKA-93.patch

Tests for the JavaOCRParser.  Next step is to start integrating into various 
other parsers.

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
 Attachments: TIKA-93.patch, TIKA-93.patch


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-93) OCR support

2014-02-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated TIKA-93:


Attachment: TIKA-93.patch

This shows what I am thinking for integration with PDFParser.  Not sure if it 
fits with what others have in mind when it comes to how the OCRParser gets 
integrated.

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)