[ 
https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379112#comment-16379112
 ] 

Tim Allison commented on TIKA-2591:
-----------------------------------

[~schmiddc], thank you for identifying this and sharing it with us.  Is a tiff 
actually a sub-type of tar, or is this a useful hack to get the proper 
behavior?  I asked for help from our colleagues on Commons Compress on their 
dev list.  It isn't immediately clear to me if we should handle this at the 
Tika level or at the Commons Compress level.



> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2591
>                 URL: https://issues.apache.org/jira/browse/TIKA-2591
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.16
>         Environment: Tika, running in a java application and a unit-test 
> (windows and mac environments)
>            Reporter: daniel schmidt
>            Priority: Major
>              Labels: newbie
>             Fix For: 1.18
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting 
> application/x-tar in Tika where it previously reported as a tiff 
> (image/tiff). 
> Observe this code in ArchiveStreamFactory, detect method.
>   // COMPRESS-117 - improve auto-recognition
>         if (signatureLength >= TAR_HEADER_SIZE) {
>             TarArchiveInputStream tais = null;
>             try {
>                 tais = new TarArchiveInputStream(new 
> ByteArrayInputStream(tarHeader));
>                 // COMPRESS-191 - verify the header checksum
>                 if (tais.getNextTarEntry().isCheckSumOK()) {
>                     return TAR;
>                 }
>             } catch (final Exception e) { // NOPMD // NOSONAR
>                 // can generate IllegalArgumentException as well
>                 // as IOException
>                 // autodetection, simply not a TAR
>                 // ignored
>             } finally {
>                 IOUtils.closeQuietly(tais);
>             }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail 
> with an exception (i.e fall into the "simply not a tar" case). However this 
> tiff actually does NOT fail here. This somewhat makes sense as the internal 
> structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper 
> tiff as it loops through its detectors in its detect method. It is detected 
> as tiff in the MimeTypes class, which is one of the implementations of the 
> Detector interface
>  
>     public MediaType detect(InputStream input, Metadata metadata)
>             throws IOException {
>         MediaType type = MediaType.OCTET_STREAM;
>         for (Detector detector : getDetectors()) {
>             //short circuit via OverrideDetector
>             //can't rely on ordering because subsequent detector may
>             //change Override's to a specialization of Override's
>             if (detector instanceof OverrideDetector &&        
> metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
>                 return detector.detect(input, metadata);
>             }
>             MediaType detected = detector.detect(input, metadata);
>             if (registry.isSpecializationOf(detected, type)) {
>                 type = detected;
>             }
>         }
>         return type;
> However since Image/tiff isn't a specialization of application/x-tar it does 
> not replace the type with tiff.
> My fix was to add a  "<sub-class-of type="application/x-tar"/>" to the 
> definition for image/tiff in the tika-mimetypes.xml file
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to