[ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bin Hawking updated TIKA-1430:
------------------------------
    Description: 
Get partially wrong text out of a CHM file, including the chm files in 
tika-parsers/src/test/resources/test-documents/testChm*.chm

I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 

I checked the source code. The cause is obvious:

When tika decompresses the LZX, the first block is done well, but as to the 2nd 
block and later on, Tika uses previous content as the compressed data. see in 
org.apache.tika.parser.chm.lzx.ChmLzxBlock

"""
                if (prevBlock != null
                        && prevBlock.getState().getBlockLength() > prevBlock
                                .getState().getBlockRemaining())
                    setChmSection(new ChmSection(prevBlock.getContent()));
//                   NOTE: the dataSegment to be decompressed is not kept
                else
                    setChmSection(new ChmSection(dataSegment));
"""

My fix:
1.      Add a prevcontent member variable in ChmSection class, so that 
dataSegment and prevBlock.getContent() are both kept in it.
2.      In ChmLzxBlock.extractContent() when invoking decompressXXXXBlock(), 
pass ChmSection.prevcontent if exists, instead of ChmSection.data.

Now, I try some chm files, and got the correct texts.

BTW. The unit test should be tougher, as in this case some small text (the 
first block) is decompressed correctly.


  was:
Get partially wrong text out of a CHM file, including the chm files in 
tika-parsers/src/test/resources/test-documents/testChm*.chm

I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 

I checked the source code. The cause is obvious:

When tika decompresses the LZX, the first block is done well, but as to the 2nd 
block and later on, Tika uses previous content as the compressed data. see in 
org.apache.tika.parser.chm.lzx.ChmLzxBlock

"""
                if (prevBlock != null
                        && prevBlock.getState().getBlockLength() > prevBlock
                                .getState().getBlockRemaining())
                    setChmSection(new ChmSection(prevBlock.getContent()));
//                   NOTE: the dataSegment to be decompressed is not kept
                else
                    setChmSection(new ChmSection(dataSegment));
"""

My fix:
1.      Add a prevcontent member variable in ChmSection class, so that 
dataSegment and prevBlock.getContent() are both kept in it.
2.      In ChmLzxBlock.extractContent() when invoking 
decompressVerbatimBlock(), pass ChmSection.prevcontent if exists, instead of 
ChmSection.data.

Now, I try some chm files, and got the correct texts.

BTW. The unit test should be tougher, as in this case some small text (the 
first block) is decompressed correctly.



> CHM parser gets faulty text (fix found)
> ---------------------------------------
>
>                 Key: TIKA-1430
>                 URL: https://issues.apache.org/jira/browse/TIKA-1430
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5, 1.6
>         Environment: Windows 7; JDK 7 or 8
>            Reporter: Bin Hawking
>            Priority: Critical
>
> Get partially wrong text out of a CHM file, including the chm files in 
> tika-parsers/src/test/resources/test-documents/testChm*.chm
> I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
> I checked the source code. The cause is obvious:
> When tika decompresses the LZX, the first block is done well, but as to the 
> 2nd block and later on, Tika uses previous content as the compressed data. 
> see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
> """
>                 if (prevBlock != null
>                         && prevBlock.getState().getBlockLength() > prevBlock
>                                 .getState().getBlockRemaining())
>                     setChmSection(new ChmSection(prevBlock.getContent()));
> //                   NOTE: the dataSegment to be decompressed is not kept
>                 else
>                     setChmSection(new ChmSection(dataSegment));
> """
> My fix:
> 1.    Add a prevcontent member variable in ChmSection class, so that 
> dataSegment and prevBlock.getContent() are both kept in it.
> 2.    In ChmLzxBlock.extractContent() when invoking decompressXXXXBlock(), 
> pass ChmSection.prevcontent if exists, instead of ChmSection.data.
> Now, I try some chm files, and got the correct texts.
> BTW. The unit test should be tougher, as in this case some small text (the 
> first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to