[jira] [Created] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-21 Thread Hoss Man (JIRA)
Hoss Man created TIKA-1526:
--

 Summary: ExternalParser should trap/ignore/workarround JDK-8047340 
 JDK-8055301 so Turkish Tika users can still use non-external parsers
 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man


the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
lowercasing being one of them...

https://bugs.openjdk.java.net/browse/JDK-8047340
https://bugs.openjdk.java.net/browse/JDK-8055301

As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled 
 configured by default in Tika, and uses ExternalParser.check to see if 
tesseract is available -- but because of the JDK bug, this means that Tika 
fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so...

{noformat}
  [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
process launch mechanism on this platform.
  [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
  [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
  [junit4] at java.security.AccessController.doPrivileged(Native 
Method)
  [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
  [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130)
  [junit4] at 
java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
  [junit4] at java.lang.Runtime.exec(Runtime.java:620)
  [junit4] at java.lang.Runtime.exec(Runtime.java:485)
  [junit4] at 
org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
  [junit4] at 
org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
  [junit4] at 
org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
  [junit4] at 
org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
  [junit4] at 
org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
  [junit4] at 
org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
  [junit4] at 
org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
  [junit4] at 
org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
  [junit4] at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
  [junit4] at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
{noformat}

...unless they go out of their way to white list only the parsers they 
need/want so TesseractOCRParser (and any other ExternalParsers) will never even 
be check()ed.

It would be nice if Tika's ExternalParser class added a similar 
hack/workarround to what was done in SOLR-6387 to trap these types of errors.  
In Solr we just propogate a better error explaining why Java hates the turkish 
langauge...

{code}
} catch (Error err) {
  if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) || 
err.getMessage().contains(UNIXProcess))) {
log.warn(Error forking command due to JVM locale bug (see 
https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
return (error executing:  + cmd + );
  }
}
{code}

...but with Tika, it might be better for all ExternalParsers to just opt out 
as if they don't recognize the filetype when they detect this type of error fro 
m the check method (or perhaps it would be better if AutoDetectParser handled 
this? ... i'm not really sure how it would best fit into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285621#comment-14285621
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

No problems, the desing looks good!

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285568#comment-14285568
 ] 

Tim Allison commented on TIKA-1511:
---

{quote}
A) I think it will work, as the patch works now. But I think an inputStream 
that can not be read is a bit strange.
{quote}
Agreed.  The new proposal is to make the InputStream readable, but the regular 
use case of an AutoDetectParser sent in via ParseContext won't bother to read 
the InputStream, rather, it will read the table object and use the 
user-supplied ContentHandler.

{quote}
B) Could it be better to send a xHTML inputStream with markup to client instead 
of simple UTF-8 encoded CSV?
{quote}
We could, but there are other ways of getting that...RecursiveParserWrapper or 
custom recursive embedded parser handler or even just sending in the plain 
AutoDetectParser as the EmbeddedDocumentExtractor/Parser in ParseContext.  The 
idea behind this is to support a ParserContainerExtractor that would normally 
pull just the bytes from embedded documents...because there are no bytes for a 
table object (i.e. it never exists as an actual standalone file), I propose a 
csv proxy.

{quote}
C) I agree, but it will work only if he adds the correct parser (eg TableParser 
or CompositeParser) to ParseContext, right?
{quote}
The user will have to add an AutoDetectParser to the ParseContext, and we will 
need to add org.apache.tika.parser.jdbc.SQLite3Parser
org.apache.tika.parser.jdbc.JDBCTableParser
to the parser services file. 

I have a draft of this proposal working.  The current downside is that if the 
client resets and rereads the InputStream, the blobs/clobs are processed twice 
via the EmbeddedDocumentExtractor.  

Any problems with the above?  Recommendations for an alternate design?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1511) Create a parser for SQLite3

2015-01-21 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1511:
--
Attachment: TIKA-1511v3.patch
testSQLLite3b.db

Slightly modified test document.  Updated patch.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3

2015-01-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285621#comment-14285621
 ] 

Luis Filipe Nassif edited comment on TIKA-1511 at 1/21/15 3:14 PM:
---

No problems, the design looks good!


was (Author: lfcnassif):
No problems, the desing looks good!

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)