Robert Fromholz created TIKA-4204:
-------------------------------------

             Summary: ChmExtractor unable to decompress file
                 Key: TIKA-4204
                 URL: https://issues.apache.org/jira/browse/TIKA-4204
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 3.0.0-BETA, 2.9.1
            Reporter: Robert Fromholz


ChmExtractor fails with error: "TikaException: can't copy beyond array length" 
when calling extractChmEntry on any non-empty entry. 

Upon inspection this turns out to be caused by lzxBlockOffset being incorrectly 
set.

This is caused by the method ChmExtractor#getIndexOfContent returing the wrong 
entry.

This is because ChmCommons#indexOf(List, String) returns the first entry with a 
name containing the string "Content". The file I am trying to parse contains a 
file with the name Content.css, which is the entry returned by #indexOf(...), 
instead of the actual content entry.

To fix the issue, ChmCommons#indexOf(...) should be more strict in how it 
detects the content entry.

According to: [http://www.russotto.net/chm/chmformat.html], the name of the 
content entry will always start with "::DataSpace/Storage/", which could be 
used to restrict it to find the correct entry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to