[ 
https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383118#comment-14383118
 ] 

Tim Allison commented on TIKA-1512:
-----------------------------------

I looked at a handful of docs from govdocs that have this exception.  I haven't 
yet seen [~Genstr]'s issue, but that seems fairly straightforward to fix.  The 
rest have no ending double quote.

They seem either to end in \r (which is sneaky with println!), or they appear 
to be truncated at 244 characters (5 examples over 2 documents of 244 character 
truncation).

For 040044.doc, this what I get when I print text before the exception:
{noformat}
HYPERLINK "http://www.nib.org\r
{noformat}

For 046839.doc, I get this:
{noformat}
HYPERLINK 
"http://web23.epnet.com/citation.asp?tb=1&_ug=dbs+0+ln+en%2Dus+sid+A7E7DA92%2D0BCF%2D42B2%2D807F%2D81A01D77748E%40sessionmgr3%2Dsessionmgr4+B360&_us=bs+TX++perceived++And+++inhibitors++And+++mathematics+db+0+ds+TX++perceived++And+++inhibitors++A
 
{noformat}

At the Tika level, I think we should be more defensive about calling substring. 
 [~gagravarr], if you'd be able to take a look at the POI level to see if 
something is going wrong there, that'd be great!

[~tuxbox],  can you tell us if the original link happened to end in \r or if 
the link was really long?

> WordParser fails on many Word files
> -----------------------------------
>
>                 Key: TIKA-1512
>                 URL: https://issues.apache.org/jira/browse/TIKA-1512
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5, 1.6, 1.7, 1.8
>         Environment: Linux 64bit
> OpenJDK Runtime Environment (IcedTea 2.4.4) (suse-24.13.5-x86_64)
> OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
> and
> java version "1.6.0"
> Java(TM) SE Runtime Environment
> IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 (JIT enabled, AOT 
> enabled)
>            Reporter: F Seid
>            Assignee: Jukka Zitting
>         Attachments: 016723.doc, TIKA-1512.doc
>
>
> WordParser fail on some word files. A negative value is sent to substring



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to