Suman Moorthy created TIKA-2714:
-----------------------------------
Summary: Tika Parse Errors for certain attachments
Key: TIKA-2714
URL: https://issues.apache.org/jira/browse/TIKA-2714
Project: Tika
Issue Type: Bug
Affects Versions: 1.9
Reporter: Suman Moorthy
Tika fails to parse certain attachments that our customers send to our
application.
We got a sample rar file from our customer that fails parsing, it only has
simple pdf files in them and we were able to re-produce the issue.
However. If WE create a new rar file out of the same contents (using winrar)
and try to parse it, that succeeds.
The rar file that our customer used is not encrypted or corrupted. Not sure why
their rar file fails parsing, but a new rar file with same contents succeeds.
Can you please provide a solution or feedback to this problem?
Below is the exception thrown when we try to parse the rar file attachment from
our customer:
Aug 02, 2018 5:04:09 AM com.github.junrar.Archive setFile
WARNING: exception in archive constructor maybe file is encrypted or currupt
com.github.junrar.exception.RarException: badRarArchive
at com.github.junrar.Archive.readHeaders(Archive.java:250)
at com.github.junrar.Archive.setFile(Archive.java:136)
at com.github.junrar.Archive.setVolume(Archive.java:581)
at com.github.junrar.Archive.<init>(Archive.java:108)
at com.github.junrar.Archive.<init>(Archive.java:113)
at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:72)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
com.actiance.platform.sfab.cis.etl.documentProcessor.internal.DocumentProcessorImpl.getExtractedContent(DocumentProcessorImpl.java:160)
at test.TikaParserAPIExample.main(TikaParserAPIExample.java:31)
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
[org.apache.tika.parser.pkg.RarParser@1372ed45|mailto:org.apache.tika.parser.pkg.RarParser@1372ed45]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
05:04:09.488 [main] DEBUG com.actiance.platform.commons.spi.FileReaderUtils -
Deleted Temp File - 0a44423c-6fad-47e6-943b-7b56178b0b7f.tmp
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
com.actiance.platform.sfab.cis.etl.documentProcessor.internal.DocumentProcessorImpl.getExtractedContent(DocumentProcessorImpl.java:160)
at test.TikaParserAPIExample.main(TikaParserAPIExample.java:31)
Caused by: java.lang.NullPointerException: mainheader is null
at com.github.junrar.Archive.isEncrypted(Archive.java:206)
at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:74)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 4 more
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)