Suman Moorthy created TIKA-2714:
-----------------------------------

             Summary: Tika Parse Errors for certain attachments
                 Key: TIKA-2714
                 URL: https://issues.apache.org/jira/browse/TIKA-2714
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.9
            Reporter: Suman Moorthy


Tika fails to parse certain attachments that our customers send to our 
application.

We got a sample rar file from our customer that fails parsing, it only has 
simple pdf files in them  and we were able to re-produce the issue.

However. If WE create a new rar file out of the same contents (using winrar) 
and try to parse it, that succeeds. 

The rar file that our customer used is not encrypted or corrupted. Not sure why 
their rar file fails parsing, but a new rar file with same contents succeeds.

Can you please provide a solution or feedback to this problem?

 

Below is the exception thrown when we try to parse the rar file attachment from 
our customer:

 

Aug 02, 2018 5:04:09 AM com.github.junrar.Archive setFile

WARNING: exception in archive constructor maybe file is encrypted or currupt

com.github.junrar.exception.RarException: badRarArchive

     at com.github.junrar.Archive.readHeaders(Archive.java:250)

     at com.github.junrar.Archive.setFile(Archive.java:136)

     at com.github.junrar.Archive.setVolume(Archive.java:581)

     at com.github.junrar.Archive.<init>(Archive.java:108)

     at com.github.junrar.Archive.<init>(Archive.java:113)

     at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:72)

     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)

     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)

     at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

     at 
com.actiance.platform.sfab.cis.etl.documentProcessor.internal.DocumentProcessorImpl.getExtractedContent(DocumentProcessorImpl.java:160)

     at test.TikaParserAPIExample.main(TikaParserAPIExample.java:31)

org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
[org.apache.tika.parser.pkg.RarParser@1372ed45|mailto:org.apache.tika.parser.pkg.RarParser@1372ed45]

     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)

05:04:09.488 [main] DEBUG com.actiance.platform.commons.spi.FileReaderUtils - 
Deleted Temp File - 0a44423c-6fad-47e6-943b-7b56178b0b7f.tmp

     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)

     at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

     at 
com.actiance.platform.sfab.cis.etl.documentProcessor.internal.DocumentProcessorImpl.getExtractedContent(DocumentProcessorImpl.java:160)

     at test.TikaParserAPIExample.main(TikaParserAPIExample.java:31)

Caused by: java.lang.NullPointerException: mainheader is null

     at com.github.junrar.Archive.isEncrypted(Archive.java:206)

     at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:74)

     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)

     ... 4 more

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to