[jira] [Created] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
Hoss Man created TIKA-1526: -- Summary: ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285621#comment-14285621 ] Luis Filipe Nassif commented on TIKA-1511: -- No problems, the desing looks good! Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285568#comment-14285568 ] Tim Allison commented on TIKA-1511: --- {quote} A) I think it will work, as the patch works now. But I think an inputStream that can not be read is a bit strange. {quote} Agreed. The new proposal is to make the InputStream readable, but the regular use case of an AutoDetectParser sent in via ParseContext won't bother to read the InputStream, rather, it will read the table object and use the user-supplied ContentHandler. {quote} B) Could it be better to send a xHTML inputStream with markup to client instead of simple UTF-8 encoded CSV? {quote} We could, but there are other ways of getting that...RecursiveParserWrapper or custom recursive embedded parser handler or even just sending in the plain AutoDetectParser as the EmbeddedDocumentExtractor/Parser in ParseContext. The idea behind this is to support a ParserContainerExtractor that would normally pull just the bytes from embedded documents...because there are no bytes for a table object (i.e. it never exists as an actual standalone file), I propose a csv proxy. {quote} C) I agree, but it will work only if he adds the correct parser (eg TableParser or CompositeParser) to ParseContext, right? {quote} The user will have to add an AutoDetectParser to the ParseContext, and we will need to add org.apache.tika.parser.jdbc.SQLite3Parser org.apache.tika.parser.jdbc.JDBCTableParser to the parser services file. I have a draft of this proposal working. The current downside is that if the client resets and rereads the InputStream, the blobs/clobs are processed twice via the EmbeddedDocumentExtractor. Any problems with the above? Recommendations for an alternate design? Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1511: -- Attachment: TIKA-1511v3.patch testSQLLite3b.db Slightly modified test document. Updated patch. Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285621#comment-14285621 ] Luis Filipe Nassif edited comment on TIKA-1511 at 1/21/15 3:14 PM: --- No problems, the design looks good! was (Author: lfcnassif): No problems, the desing looks good! Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)