Robert Fromholz created TIKA-4204:
-------------------------------------
Summary: ChmExtractor unable to decompress file
Key: TIKA-4204
URL: https://issues.apache.org/jira/browse/TIKA-4204
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 3.0.0-BETA, 2.9.1
Reporter: Robert Fromholz
ChmExtractor fails with error: "TikaException: can't copy beyond array length"
when calling extractChmEntry on any non-empty entry.
Upon inspection this turns out to be caused by lzxBlockOffset being incorrectly
set.
This is caused by the method ChmExtractor#getIndexOfContent returing the wrong
entry.
This is because ChmCommons#indexOf(List, String) returns the first entry with a
name containing the string "Content". The file I am trying to parse contains a
file with the name Content.css, which is the entry returned by #indexOf(...),
instead of the actual content entry.
To fix the issue, ChmCommons#indexOf(...) should be more strict in how it
detects the content entry.
According to: [http://www.russotto.net/chm/chmformat.html], the name of the
content entry will always start with "::DataSpace/Storage/", which could be
used to restrict it to find the correct entry.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)