[ 
https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383118#comment-14383118
 ] 

Tim Allison edited comment on TIKA-1512 at 3/27/15 1:21 AM:
------------------------------------------------------------

I looked at a handful of docs from govdocs that have this exception.  I haven't 
yet seen [~Genstr]'s issue, but that seems fairly straightforward to fix.  The 
rest have no ending double quote.

They seem either to end in \r (which is sneaky with println!), or they appear 
to be truncated at 244 characters (5 examples over 2 documents of 244 character 
truncation). 
For the short ones, a hexeditor shows that there is an ending double quote 
after the \u000D=\r, but POI is not apparently capturing the ".

For cases with 244 characters...when you add in HYPERLINK, spaces and a dquote, 
the text lengths are 257, a number curiously similar to a rather familiar 
number.  With a hexeditor, you can see that the long hyperlinks just end, then 
there is a \u0020 and \u0014 (special text boundary markers, I assume?), and 
the regular text starts up again.

For 040044.doc, this what I get when I print text before the exception:
{noformat}
HYPERLINK "http://www.nib.org\r
{noformat}

For 046839.doc, I get this:
{noformat}
HYPERLINK 
"http://web23.epnet.com/citation.asp?tb=1&_ug=dbs+0+ln+en%2Dus+sid+A7E7DA92%2D0BCF%2D42B2%2D807F%2D81A01D77748E%40sessionmgr3%2Dsessionmgr4+B360&_us=bs+TX++perceived++And+++inhibitors++And+++mathematics+db+0+ds+TX++perceived++And+++inhibitors++A
 
{noformat}

At the Tika level, I think we should be more defensive about calling substring. 
 [~gagravarr], if you'd be able to take a look at the POI level to see if 
something is going wrong there, that'd be great!

[~tuxbox],  can you tell us if the original link happened to end in \r or if 
the link was really long?


was (Author: [email protected]):
I looked at a handful of docs from govdocs that have this exception.  I haven't 
yet seen [~Genstr]'s issue, but that seems fairly straightforward to fix.  The 
rest have no ending double quote.

They seem either to end in \r (which is sneaky with println!), or they appear 
to be truncated at 244 characters (5 examples over 2 documents of 244 character 
truncation).

For 040044.doc, this what I get when I print text before the exception:
{noformat}
HYPERLINK "http://www.nib.org\r
{noformat}

For 046839.doc, I get this:
{noformat}
HYPERLINK 
"http://web23.epnet.com/citation.asp?tb=1&_ug=dbs+0+ln+en%2Dus+sid+A7E7DA92%2D0BCF%2D42B2%2D807F%2D81A01D77748E%40sessionmgr3%2Dsessionmgr4+B360&_us=bs+TX++perceived++And+++inhibitors++And+++mathematics+db+0+ds+TX++perceived++And+++inhibitors++A
 
{noformat}

At the Tika level, I think we should be more defensive about calling substring. 
 [~gagravarr], if you'd be able to take a look at the POI level to see if 
something is going wrong there, that'd be great!

[~tuxbox],  can you tell us if the original link happened to end in \r or if 
the link was really long?

> WordParser fails on many Word files
> -----------------------------------
>
>                 Key: TIKA-1512
>                 URL: https://issues.apache.org/jira/browse/TIKA-1512
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5, 1.6, 1.7, 1.8
>         Environment: Linux 64bit
> OpenJDK Runtime Environment (IcedTea 2.4.4) (suse-24.13.5-x86_64)
> OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
> and
> java version "1.6.0"
> Java(TM) SE Runtime Environment
> IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 (JIT enabled, AOT 
> enabled)
>            Reporter: F Seid
>            Assignee: Jukka Zitting
>         Attachments: 016723.doc, 040044.doc, 046839.doc, TIKA-1512.doc
>
>
> WordParser fail on some word files. A negative value is sent to substring



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to