[ 
https://issues.apache.org/jira/browse/TIKA-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997216#comment-15997216
 ] 

Pascal Essiembre commented on TIKA-2352:
----------------------------------------

I had time to look further at one of the file in lists: 
"govdocs1\318\318891.wp".  It puzzles me and I feel I must be missing something 
obvious.  LibreOffice opens it fine.

It is read just fine until the last page where there is an isolated "1" in the 
middle of the page.  The sequence of interest is "31 02 02 DA D0 04 D0", which 
can be broken down as follow:

31 - The number "1"
02 - Control character indicating to print a page number
02 - Control character indicating to print a page number
DA - Variable-length function (218) for a "box group"
D0 - Subfunction code 208.  INVALID, possible values range from 0 to 6.
04 D0 - function length 53252 (two bytes, reverse order).  INVALID, greater 
than what's left.

So I do not know why this invalid function code is there and how LibreOffice 
interprets it fine.  It may be the 0x02 also throwing things off... since it is 
the only place those characters are found in the document and it goes wrong 
after that.

In other context (non WP docs), the ASCII standard for 0x02 is "STX -> Start of 
Text -> First character of message text", and may be used to terminate the 
message heading"

Since there is a page number in the middle, it could be that the page/document 
is ended there and a new one is appended?  If so, not sure then how 0x02 should 
be treated in relation to that.

> Incorrect EOF exception in WordPerfect parser
> ---------------------------------------------
>
>                 Key: TIKA-2352
>                 URL: https://issues.apache.org/jira/browse/TIKA-2352
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Trivial
>             Fix For: 2.0, 1.15
>
>         Attachments: 462321.wp, reports.zip
>
>
> We have a few EOF exceptions in WordPerfect files that are likely not 
> truncated.  The example I'll attach shortly is able to be opened without 
> complaint by LibreOffice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to