Tika 715 (invalid xhtml output)

2013-12-11 Thread Raymond Wiker
Ref: https://issues.apache.org/jira/browse/TIKA-715

I'm using Tika-app-1.4 (in server-mode) in a stand-alone document
processing pipeline, and have discovered that a lot of the xhtml from Tika
is invalid. Subsequently, I found Tika-715, which appears to cover exactly
this.

Because of this issue, I cannot use my preferred XML parsing library to
extract metadata and text from the xhtml output. As a workaround, I have
tried to use an HTML parser, instead; this works, but requires much more
resources (cpu time and memory).

Is there hope for a fix for this issue in the near future, or should I just
concentrate on improving my code for working on the html format?


[jira] [Created] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-11 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1205:
-

 Summary: Allow PDFParser to fallback to other parser if there is 
an exception
 Key: TIKA-1205
 URL: https://issues.apache.org/jira/browse/TIKA-1205
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5


With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
instead of the traditional parser for parsing PDF files.  Following the 
description in PDFBOX-1199, it would be useful to allow fallback to the classic 
parser if NonSequentialPDFParser encounters an IOException.  For the sake of 
symmetry, I propose a boolean useParserFallbackOnException parameter.  If this 
parameter is true, and if Tika's PDFParser is using the classic parser, Tika 
will fall back to the NonSequentialPDFParser if there is an IOException; if 
this parameter is true and if Tika's PDFParser is using the 
NonSequentialPDFParser it will fall back to the classic parser if there is an 
IOException.

Many thanks to Hong-Thai for championing the addition of the added 
NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-11 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845398#comment-13845398
 ] 

Hong-Thai Nguyen commented on TIKA-1205:


Just a (newbie) question, why limit only on PDFParser, not for any other parser 
?
I agree that fallback is necessary when having exception. But, the worst case 
is infinitive loop happens when parsing a document.

For these two purposes, we would generalize to handle exception and timeout 
properly in a wrapper ?

 Allow PDFParser to fallback to other parser if there is an exception
 

 Key: TIKA-1205
 URL: https://issues.apache.org/jira/browse/TIKA-1205
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5


 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
 instead of the traditional parser for parsing PDF files.  Following the 
 description in PDFBOX-1199, it would be useful to allow fallback to the 
 classic parser if NonSequentialPDFParser throws an IOException.  For the sake 
 of symmetry, I propose a boolean useParserFallbackOnException parameter.  If 
 this parameter is true, and if Tika's PDFParser is using the classic parser, 
 Tika will fallback to the NonSequentialPDFParser if there is an IOException; 
 if this parameter is true and if Tika's PDFParser is using the 
 NonSequentialPDFParser it will fallback to the classic parser if there is an 
 IOException.
 Many thanks to Hong-Thai for championing the addition of the added 
 NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
 PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845429#comment-13845429
 ] 

Tim Allison commented on TIKA-1205:
---

Thank you for your feedback!  TIKA-456 is the existing issue for general 
timeout capability.  I agree that it would be great to add.  TIKA-1205 is a 
very narrowly defined improvement for PDFParser.

 Allow PDFParser to fallback to other parser if there is an exception
 

 Key: TIKA-1205
 URL: https://issues.apache.org/jira/browse/TIKA-1205
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5


 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
 instead of the traditional parser for parsing PDF files.  Following the 
 description in PDFBOX-1199, it would be useful to allow fallback to the 
 classic parser if NonSequentialPDFParser throws an IOException.  For the sake 
 of symmetry, I propose a boolean useParserFallbackOnException parameter.  If 
 this parameter is true, and if Tika's PDFParser is using the classic parser, 
 Tika will fallback to the NonSequentialPDFParser if there is an IOException; 
 if this parameter is true and if Tika's PDFParser is using the 
 NonSequentialPDFParser it will fallback to the classic parser if there is an 
 IOException.
 Many thanks to Hong-Thai for championing the addition of the added 
 NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
 PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (TIKA-1121) Socket server text parsing error on large text files

2013-12-11 Thread Mane (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845569#comment-13845569
 ] 

Mane commented on TIKA-1121:


I've tested with tika jax-rs server, it works with cases where it was sending 
me server error (tried with pdf files which are still returning Server Error on 
version 1.4). 

But it still hangs on very large html file when I used tika app in server mode 
(socket server). 

 Socket server text parsing error on large text files
 

 Key: TIKA-1121
 URL: https://issues.apache.org/jira/browse/TIKA-1121
 Project: Tika
  Issue Type: Bug
  Components: cli
Affects Versions: 1.4
 Environment: Ubuntu 10.04, 10.10, 12.04.02
Reporter: Dave Meikle
Assignee: Dave Meikle

 As reported on the user list[1], when using the tika-app socket server 
 command with the -t switch to parse text, the process hangs on large text 
 files.
 This occurs on Ubuntu 10.04, 10.10 and 12.04.02.
 [1]http://mail-archives.apache.org/mod_mbox/tika-user/201305.mbox/%3ccagxbzufxsj4h5jwdeux9hhd2fxttq1vsbm7u-vfsyge9vmr...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Comment Edited] (TIKA-1121) Socket server text parsing error on large text files

2013-12-11 Thread Mane (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845571#comment-13845571
 ] 

Mane edited comment on TIKA-1121 at 12/11/13 5:43 PM:
--

Also worth to mention, I've this gibberish.txt file that have garbage text in 
2mb file, which Tika Socket server is unable to parser and hangs (downloaded 
this file from one of the tickets i've found related to tika , i do not have 
the link of that file any more)


was (Author: mane_genius):
Also worth to mention, I've this gibberish.txt file that have garbage text in 
2mb file, which Tika Socket server is unable to parser and return the text 
(download this file from one of the tickets i've found related to tika , i do 
not have the link of that file any more)

 Socket server text parsing error on large text files
 

 Key: TIKA-1121
 URL: https://issues.apache.org/jira/browse/TIKA-1121
 Project: Tika
  Issue Type: Bug
  Components: cli
Affects Versions: 1.4
 Environment: Ubuntu 10.04, 10.10, 12.04.02
Reporter: Dave Meikle
Assignee: Dave Meikle

 As reported on the user list[1], when using the tika-app socket server 
 command with the -t switch to parse text, the process hangs on large text 
 files.
 This occurs on Ubuntu 10.04, 10.10 and 12.04.02.
 [1]http://mail-archives.apache.org/mod_mbox/tika-user/201305.mbox/%3ccagxbzufxsj4h5jwdeux9hhd2fxttq1vsbm7u-vfsyge9vmr...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Re: [jira] [Comment Edited] (TIKA-1121) Socket server text parsing error on large text files

2013-12-11 Thread Raymond Wiker
I've been reading through some of the emails referenced, and it looks like the 
problem might be in the code on the client side.

In one of the emails from May 2013, the client-side code tries to write the 
entire file to Tika, and then to read the extracted text back. I had a similar 
problem with some files, and discovered that, for certain files, Tika started 
to write back extracted text before the entire file had been written. At some 
point, a deadlock situation arose where each side was waiting for the other to 
read what had been written to the socket.

I solved this by running the read part on the client side in a separate thread. 
This appears to work fine – I have seen no strange hangs even after feeding 
close to a million files in sizes up to 100MB through a single Tika process.