[
https://issues.apache.org/jira/browse/TIKA-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-4204.
-------------------------------
Fix Version/s: 2.9.2
3.0.0
Resolution: Fixed
> ChmExtractor unable to decompress file
> --------------------------------------
>
> Key: TIKA-4204
> URL: https://issues.apache.org/jira/browse/TIKA-4204
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.9.1, 3.0.0-BETA
> Environment: The file I am trying to parse is attached, the file
> being found as the content file is "/CSS/ABBContent.css"
> Reporter: Robert Fromholz
> Assignee: Tim Allison
> Priority: Blocker
> Fix For: 2.9.2, 3.0.0
>
> Attachments: 3HAC050917_TRM_RAPID_RW_6-en.chm
>
>
> ChmExtractor fails with error: "TikaException: can't copy beyond array
> length" when calling extractChmEntry on any non-empty entry.
> Upon inspection this turns out to be caused by lzxBlockOffset being
> incorrectly set.
> This is caused by the method ChmExtractor#getIndexOfContent returing the
> wrong entry.
> This is because ChmCommons#indexOf(List, String) returns the first entry with
> a name containing the string "Content". The file I am trying to parse
> contains a file with the name Content.css, which is the entry returned by
> #indexOf(...), instead of the actual content entry.
> To fix the issue, ChmCommons#indexOf(...) should be more strict in how it
> detects the content entry.
> According to: [http://www.russotto.net/chm/chmformat.html], the name of the
> content entry will always start with "::DataSpace/Storage/", which could be
> used to restrict it to find the correct entry.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)