Hi every one.

I would like to talk about a compression algorithm that doesn't get parsed
by Tika yet, but could be.  The compression algorithm is called 'implode'
and there is a patch for Apache-compress that can handle this particular
compression algorithm that is not yet leveraged by Tika.

There is a unit test in Tika in ZipParserTest.java:

It is a test to demonstrate that just the names of files get extracted when
the zip is compressed with the 'implode' compression algorithm.
The file moby.zip in the test data is compressed with this type of
compression.



    /**

     * Test case for the ability of the ZIP parser to extract the name of

     * a ZIP entry even if the content of the entry is unreadable due to an

     * unsupported compression method.

     *

     * *@see* <a href="https://issues.apache.org/jira/browse/TIKA-346";
<https://issues.apache.org/jira/browse/TIKA-346>>TIKA-346</a>

     */

    @Test

    *public* *void* testUnsupportedZipCompressionMethod() *throws*
Exception {

        String content = *new* Tika().parseToString(

                ZipParserTest.*class*.getResourceAsStream(

                        "/test-documents/moby.zip"));

        *assertTrue*(content.contains("README"));

    }


The implode compression algorithm is an old proprietary compression
algorithm that used to be used by PKZIP in the '80's.

It uses Shannon Fano coding, which has fallen out of favor since huffman
coding is more efficient.

Point being, Tika-1.5 uses apache-commons-compress 1-5. According to
the Apache compress jira ticket below, Apache compress can

handle this compression method for compress version greater than 1.7. I was
wondering, if I wrote a patch for this if I could contribute to the tika or
if this was worthy of being opened as an issue.




https://issues.apache.org/jira/browse/COMPRESS-115

http://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding

https://issues.apache.org/jira/browse/COMPRESS-115

Reply via email to