Tika 715 (invalid xhtml output)
Ref: https://issues.apache.org/jira/browse/TIKA-715 I'm using Tika-app-1.4 (in server-mode) in a stand-alone document processing pipeline, and have discovered that a lot of the xhtml from Tika is invalid. Subsequently, I found Tika-715, which appears to cover exactly this. Because of this issue, I cannot use my preferred XML parsing library to extract metadata and text from the xhtml output. As a workaround, I have tried to use an HTML parser, instead; this works, but requires much more resources (cpu time and memory). Is there hope for a fix for this issue in the near future, or should I just concentrate on improving my code for working on the html format?
[jira] [Created] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception
Tim Allison created TIKA-1205: - Summary: Allow PDFParser to fallback to other parser if there is an exception Key: TIKA-1205 URL: https://issues.apache.org/jira/browse/TIKA-1205 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser instead of the traditional parser for parsing PDF files. Following the description in PDFBOX-1199, it would be useful to allow fallback to the classic parser if NonSequentialPDFParser encounters an IOException. For the sake of symmetry, I propose a boolean useParserFallbackOnException parameter. If this parameter is true, and if Tika's PDFParser is using the classic parser, Tika will fall back to the NonSequentialPDFParser if there is an IOException; if this parameter is true and if Tika's PDFParser is using the NonSequentialPDFParser it will fall back to the classic parser if there is an IOException. Many thanks to Hong-Thai for championing the addition of the added NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for PDFBox's NonSequentialPDFParser (PDFBOX-1199)! -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception
[ https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845398#comment-13845398 ] Hong-Thai Nguyen commented on TIKA-1205: Just a (newbie) question, why limit only on PDFParser, not for any other parser ? I agree that fallback is necessary when having exception. But, the worst case is infinitive loop happens when parsing a document. For these two purposes, we would generalize to handle exception and timeout properly in a wrapper ? Allow PDFParser to fallback to other parser if there is an exception Key: TIKA-1205 URL: https://issues.apache.org/jira/browse/TIKA-1205 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser instead of the traditional parser for parsing PDF files. Following the description in PDFBOX-1199, it would be useful to allow fallback to the classic parser if NonSequentialPDFParser throws an IOException. For the sake of symmetry, I propose a boolean useParserFallbackOnException parameter. If this parameter is true, and if Tika's PDFParser is using the classic parser, Tika will fallback to the NonSequentialPDFParser if there is an IOException; if this parameter is true and if Tika's PDFParser is using the NonSequentialPDFParser it will fallback to the classic parser if there is an IOException. Many thanks to Hong-Thai for championing the addition of the added NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for PDFBox's NonSequentialPDFParser (PDFBOX-1199)! -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception
[ https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845429#comment-13845429 ] Tim Allison commented on TIKA-1205: --- Thank you for your feedback! TIKA-456 is the existing issue for general timeout capability. I agree that it would be great to add. TIKA-1205 is a very narrowly defined improvement for PDFParser. Allow PDFParser to fallback to other parser if there is an exception Key: TIKA-1205 URL: https://issues.apache.org/jira/browse/TIKA-1205 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser instead of the traditional parser for parsing PDF files. Following the description in PDFBOX-1199, it would be useful to allow fallback to the classic parser if NonSequentialPDFParser throws an IOException. For the sake of symmetry, I propose a boolean useParserFallbackOnException parameter. If this parameter is true, and if Tika's PDFParser is using the classic parser, Tika will fallback to the NonSequentialPDFParser if there is an IOException; if this parameter is true and if Tika's PDFParser is using the NonSequentialPDFParser it will fallback to the classic parser if there is an IOException. Many thanks to Hong-Thai for championing the addition of the added NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for PDFBox's NonSequentialPDFParser (PDFBOX-1199)! -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (TIKA-1121) Socket server text parsing error on large text files
[ https://issues.apache.org/jira/browse/TIKA-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845569#comment-13845569 ] Mane commented on TIKA-1121: I've tested with tika jax-rs server, it works with cases where it was sending me server error (tried with pdf files which are still returning Server Error on version 1.4). But it still hangs on very large html file when I used tika app in server mode (socket server). Socket server text parsing error on large text files Key: TIKA-1121 URL: https://issues.apache.org/jira/browse/TIKA-1121 Project: Tika Issue Type: Bug Components: cli Affects Versions: 1.4 Environment: Ubuntu 10.04, 10.10, 12.04.02 Reporter: Dave Meikle Assignee: Dave Meikle As reported on the user list[1], when using the tika-app socket server command with the -t switch to parse text, the process hangs on large text files. This occurs on Ubuntu 10.04, 10.10 and 12.04.02. [1]http://mail-archives.apache.org/mod_mbox/tika-user/201305.mbox/%3ccagxbzufxsj4h5jwdeux9hhd2fxttq1vsbm7u-vfsyge9vmr...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Comment Edited] (TIKA-1121) Socket server text parsing error on large text files
[ https://issues.apache.org/jira/browse/TIKA-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845571#comment-13845571 ] Mane edited comment on TIKA-1121 at 12/11/13 5:43 PM: -- Also worth to mention, I've this gibberish.txt file that have garbage text in 2mb file, which Tika Socket server is unable to parser and hangs (downloaded this file from one of the tickets i've found related to tika , i do not have the link of that file any more) was (Author: mane_genius): Also worth to mention, I've this gibberish.txt file that have garbage text in 2mb file, which Tika Socket server is unable to parser and return the text (download this file from one of the tickets i've found related to tika , i do not have the link of that file any more) Socket server text parsing error on large text files Key: TIKA-1121 URL: https://issues.apache.org/jira/browse/TIKA-1121 Project: Tika Issue Type: Bug Components: cli Affects Versions: 1.4 Environment: Ubuntu 10.04, 10.10, 12.04.02 Reporter: Dave Meikle Assignee: Dave Meikle As reported on the user list[1], when using the tika-app socket server command with the -t switch to parse text, the process hangs on large text files. This occurs on Ubuntu 10.04, 10.10 and 12.04.02. [1]http://mail-archives.apache.org/mod_mbox/tika-user/201305.mbox/%3ccagxbzufxsj4h5jwdeux9hhd2fxttq1vsbm7u-vfsyge9vmr...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.4#6159)
Re: [jira] [Comment Edited] (TIKA-1121) Socket server text parsing error on large text files
I've been reading through some of the emails referenced, and it looks like the problem might be in the code on the client side. In one of the emails from May 2013, the client-side code tries to write the entire file to Tika, and then to read the extracted text back. I had a similar problem with some files, and discovered that, for certain files, Tika started to write back extracted text before the entire file had been written. At some point, a deadlock situation arose where each side was waiting for the other to read what had been written to the socket. I solved this by running the read part on the client side in a separate thread. This appears to work fine – I have seen no strange hangs even after feeding close to a million files in sizes up to 100MB through a single Tika process.