[ 
https://issues.apache.org/jira/browse/TIKA-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057022#comment-17057022
 ] 

Nick Burch commented on TIKA-2714:
----------------------------------

>From [https://www.rarlab.com/technote.htm]
----
h3. RAR 5.0 signature

RAR 5.0 signature consists of 8 bytes: 0x52 0x61 0x72 0x21 0x1A 0x07 0x01 0x00. 
You need to search for this signature in supposed archive from beginning and up 
to maximum SFX module size. Just for comparison this is RAR 4.x 7 byte length 
signature: 0x52 0x61 0x72 0x21 0x1A 0x07 0x00.
----
Not sure if we want to scan for it all the way to the end of the possible 
self-extracting block (1mb), but at offset 0 should be fine

> Tika Parse Errors for certain attachments
> -----------------------------------------
>
>                 Key: TIKA-2714
>                 URL: https://issues.apache.org/jira/browse/TIKA-2714
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: Suman Moorthy
>            Priority: Major
>
> Tika fails to parse certain attachments that our customers send to our 
> application.
> We got a sample rar file from our customer that fails parsing, it only has 
> simple pdf files in them  and we were able to re-produce the issue.
> However. If WE create a new rar file out of the same contents (using winrar) 
> and try to parse it, that succeeds. 
> The rar file that our customer used is not encrypted or corrupted. Not sure 
> why their rar file fails parsing, but a new rar file with same contents 
> succeeds.
> Can you please provide a solution or feedback to this problem?
>  
> Below is the exception thrown when we try to parse the rar file attachment 
> from our customer:
>  
> Aug 02, 2018 5:04:09 AM com.github.junrar.Archive setFile
> WARNING: exception in archive constructor maybe file is encrypted or currupt
> com.github.junrar.exception.RarException: badRarArchive
>      at com.github.junrar.Archive.readHeaders(Archive.java:250)
>      at com.github.junrar.Archive.setFile(Archive.java:136)
>      at com.github.junrar.Archive.setVolume(Archive.java:581)
>      at com.github.junrar.Archive.<init>(Archive.java:108)
>      at com.github.junrar.Archive.<init>(Archive.java:113)
>      at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:72)
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>      at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>      at 
> com.actiance.platform.sfab.cis.etl.documentProcessor.internal.DocumentProcessorImpl.getExtractedContent(DocumentProcessorImpl.java:160)
>      at test.TikaParserAPIExample.main(TikaParserAPIExample.java:31)
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> [org.apache.tika.parser.pkg.RarParser@1372ed45|mailto:org.apache.tika.parser.pkg.RarParser@1372ed45]
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> 05:04:09.488 [main] DEBUG com.actiance.platform.commons.spi.FileReaderUtils - 
> Deleted Temp File - 0a44423c-6fad-47e6-943b-7b56178b0b7f.tmp
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>      at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>      at 
> com.actiance.platform.sfab.cis.etl.documentProcessor.internal.DocumentProcessorImpl.getExtractedContent(DocumentProcessorImpl.java:160)
>      at test.TikaParserAPIExample.main(TikaParserAPIExample.java:31)
> Caused by: java.lang.NullPointerException: mainheader is null
>      at com.github.junrar.Archive.isEncrypted(Archive.java:206)
>      at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:74)
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>      ... 4 more
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to