[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-93: - Labels: memex (was: ) OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.7 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, TesseractOCR_Tyler_v3.patch, TesseractOCR_Tyler_v4.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-93: Attachment: TesseractOCR_Tyler_v4.patch Thank you for the input! I attached a new patch (v4) which uses `junit.Assume` to ignore the tests if Tesseract is not installed and cleans up some of the Exception throwing. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, TesseractOCR_Tyler_v3.patch, TesseractOCR_Tyler_v4.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-93: Assignee: Chris A. Mattmann (was: Tyler Palsulich) OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-93: Attachment: TesseractOCR_Tyler_v3.patch Updated patch which passes all tests whether Tesseract is installed or not. I updated the review board, too. See https://reviews.apache.org/r/22402/. Also, whoops, I hit a hotkey to assign the issue to me. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, TesseractOCR_Tyler_v3.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petr Vas updated TIKA-93: - Attachment: Petr_tika-config.xml Sure, here is config. Source code that I am currently using cn be found here: https://github.com/datanav/tika/tree/ocr-tika-server (forked version of Apache's repo with custom branch) OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-93: Attachment: TesseractOCR_Tyler_v2.patch Minor updates to the patch: Moved the OCRParser to tika-parsers (unless others think it should be in tika-core?), moved the files from test-documents/ocr to just test-documents. In PDFParserTest, I added testOCR.pdf to the list of known metadataDiff, since the PDF version is different for the NonSeq and Seq PDFBox parsers. In tika-server TikaMimeTypesTest, I changed testGetJSON() -- will someone look at this part? Something seems weird about it. There still needs to be a check for if Tesseract is installed, and where. I looked a bit at the ExternalParser code -- it seems useful, but I'm not sure how to combine TesseractOCRParser and ExternalParser. Can someone else chime in? At this point, I don't think we need more than a call to ExternalParser.check(). But, I could be wrong. In my opinion, we should just require that Tesseract be on the user's path. It's an uncommon program. So, if a user installs it, it will probably be *for* Tika OCR. So, it's not a big deal for them to put it on their path. I put up a review: https://reviews.apache.org/r/22402/. I don't think this is ready yet, but I'd like to get it moving. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-93: Attachment: TesseractOCR_Tyler.patch Awesome! I attached another patch which includes TesseractOCRParser.patch with unit tests for the parser (PDF, PPTX, and DOCX files with embedded images with text). We could use more tests for images with no next, blurry text, and so on. But, I don't know how good Tesseract is. Steps to apply this patch: install Tesseract \[1\], apply the patch, move the test files into tika-parsers/src/test/resources/test-documents/ocr. Run the tests with {{mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest -DfailIfNoTests=false}}. What needs to happen from here? How should we include Tesseract in the sources? How should we handle timeouts (give the user a warning that OCR can be slow/timed out)? \[1\] - [https://code.google.com/p/tesseract-ocr/wiki/ReadMe] OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.6 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Filipe Nassif updated TIKA-93: --- Attachment: TesseractOCRParser.patch Patch with first version of a tesseract-ocr based OCRParser, with simple timeout control. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.6 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Filipe Nassif updated TIKA-93: --- Attachment: TesseractOCRParser.patch Better timeout control using FutureTask OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.6 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated TIKA-93: Attachment: testOCR.pptx testOCR.pdf testOCR.docx TIKA-93.patch Not sure if this is progress or not... The testOCR.* files need to go in the parsers/src/test/resources/test-documents directory. Things that changed: # Moved config to ParseContext instead of one off implementation in PDFParserConfig. # Used the existing ParseContext for passing in the OCRParser instead of separate handling # Added some more test files. Will upload them. Things I could use help on: # Trying to get this integrated into the Office stuff. I see the DELEGATING_PARSER capabilities for embedded extraction, but not quite sure about how to best leverage that. See JavaOCRParserTest.testOCR for some attempts at setting up the test # Overall, my biggest lack of understanding is around how to configure this stuff. As I see it, we need to be able to set 2 things: ## The OCRParser or Delegatingparser. I'm not sure how embedded contexts are used in practice. Note, some of the OCRParser implementations will require configuration/training before they can be used. ## Whether or not to actually use the OCRParser (a boolean flag), as OCR is expensive and not everyone will want it for every doc, etc. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated TIKA-93: Attachment: TIKA-93.patch Here is a _very_ early stage patch that creates a JavaOCR parser. It is not integrated into any of the other parsers, yet. I also added Jacoco code coverage to the Parent POM so that we can now generate coverage reports. For example: # mvn verify (from the top level) Or, after running mvn test # mvn jacoco:check Once done, check the target/site/jacoco directory to see the reports. Not sure on Tika workflow for JIRA, but if someone wants to Assign this Issue to me, I'll take it the next few steps. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Priority: Minor Attachments: TIKA-93.patch I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated TIKA-93: Attachment: TIKA-93.patch Tests for the JavaOCRParser. Next step is to start integrating into various other parsers. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Priority: Minor Attachments: TIKA-93.patch, TIKA-93.patch I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated TIKA-93: Attachment: TIKA-93.patch This shows what I am thinking for integration with PDFParser. Not sure if it fits with what others have in mind when it comes to how the OCRParser gets integrated. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)