[
https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383118#comment-14383118
]
Tim Allison edited comment on TIKA-1512 at 3/27/15 1:21 AM:
------------------------------------------------------------
I looked at a handful of docs from govdocs that have this exception. I haven't
yet seen [~Genstr]'s issue, but that seems fairly straightforward to fix. The
rest have no ending double quote.
They seem either to end in \r (which is sneaky with println!), or they appear
to be truncated at 244 characters (5 examples over 2 documents of 244 character
truncation).
For the short ones, a hexeditor shows that there is an ending double quote
after the \u000D=\r, but POI is not apparently capturing the ".
For cases with 244 characters...when you add in HYPERLINK, spaces and a dquote,
the text lengths are 257, a number curiously similar to a rather familiar
number. With a hexeditor, you can see that the long hyperlinks just end, then
there is a \u0020 and \u0014 (special text boundary markers, I assume?), and
the regular text starts up again.
For 040044.doc, this what I get when I print text before the exception:
{noformat}
HYPERLINK "http://www.nib.org\r
{noformat}
For 046839.doc, I get this:
{noformat}
HYPERLINK
"http://web23.epnet.com/citation.asp?tb=1&_ug=dbs+0+ln+en%2Dus+sid+A7E7DA92%2D0BCF%2D42B2%2D807F%2D81A01D77748E%40sessionmgr3%2Dsessionmgr4+B360&_us=bs+TX++perceived++And+++inhibitors++And+++mathematics+db+0+ds+TX++perceived++And+++inhibitors++A
{noformat}
At the Tika level, I think we should be more defensive about calling substring.
[~gagravarr], if you'd be able to take a look at the POI level to see if
something is going wrong there, that'd be great!
[~tuxbox], can you tell us if the original link happened to end in \r or if
the link was really long?
was (Author: [email protected]):
I looked at a handful of docs from govdocs that have this exception. I haven't
yet seen [~Genstr]'s issue, but that seems fairly straightforward to fix. The
rest have no ending double quote.
They seem either to end in \r (which is sneaky with println!), or they appear
to be truncated at 244 characters (5 examples over 2 documents of 244 character
truncation).
For 040044.doc, this what I get when I print text before the exception:
{noformat}
HYPERLINK "http://www.nib.org\r
{noformat}
For 046839.doc, I get this:
{noformat}
HYPERLINK
"http://web23.epnet.com/citation.asp?tb=1&_ug=dbs+0+ln+en%2Dus+sid+A7E7DA92%2D0BCF%2D42B2%2D807F%2D81A01D77748E%40sessionmgr3%2Dsessionmgr4+B360&_us=bs+TX++perceived++And+++inhibitors++And+++mathematics+db+0+ds+TX++perceived++And+++inhibitors++A
{noformat}
At the Tika level, I think we should be more defensive about calling substring.
[~gagravarr], if you'd be able to take a look at the POI level to see if
something is going wrong there, that'd be great!
[~tuxbox], can you tell us if the original link happened to end in \r or if
the link was really long?
> WordParser fails on many Word files
> -----------------------------------
>
> Key: TIKA-1512
> URL: https://issues.apache.org/jira/browse/TIKA-1512
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.5, 1.6, 1.7, 1.8
> Environment: Linux 64bit
> OpenJDK Runtime Environment (IcedTea 2.4.4) (suse-24.13.5-x86_64)
> OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
> and
> java version "1.6.0"
> Java(TM) SE Runtime Environment
> IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 (JIT enabled, AOT
> enabled)
> Reporter: F Seid
> Assignee: Jukka Zitting
> Attachments: 016723.doc, 040044.doc, 046839.doc, TIKA-1512.doc
>
>
> WordParser fail on some word files. A negative value is sent to substring
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)