[
https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378621#comment-16378621
]
daniel schmidt commented on TIKA-2591:
--------------------------------------
I forgot to add. I would like to provide the example image, but I am not
permitted as it contains protected health information. If I find one that does
not. I will in include it.
> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> ---------------------------------------------------------------------
>
> Key: TIKA-2591
> URL: https://issues.apache.org/jira/browse/TIKA-2591
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 1.16
> Environment: Tika, running in a java application and a unit-test
> (windows and mac environments)
> Reporter: daniel schmidt
> Priority: Major
> Labels: newbie
> Fix For: 1.18
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting
> application/x-tar in Tika where it previously reported as a tiff
> (image/tiff).
> Observe this code in ArchiveStreamFactory, detect method.
> // COMPRESS-117 - improve auto-recognition
> if (signatureLength >= TAR_HEADER_SIZE) {
> TarArchiveInputStream tais = null;
> try {
> tais = new TarArchiveInputStream(new
> ByteArrayInputStream(tarHeader));
> // COMPRESS-191 - verify the header checksum
> if (tais.getNextTarEntry().isCheckSumOK()) {
> return TAR;
> }
> } catch (final Exception e) { // NOPMD // NOSONAR
> // can generate IllegalArgumentException as well
> // as IOException
> // autodetection, simply not a TAR
> // ignored
> } finally {
> IOUtils.closeQuietly(tais);
> }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail
> with an exception (i.e fall into the "simply not a tar" case). However this
> tiff actually does NOT fail here. This somewhat makes sense as the internal
> structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper
> tiff as it loops through its detectors in its detect method. It is detected
> as tiff in the MimeTypes class, which is one of the implementations of the
> Detector interface
>
> public MediaType detect(InputStream input, Metadata metadata)
> throws IOException {
> MediaType type = MediaType.OCTET_STREAM;
> for (Detector detector : getDetectors()) {
> //short circuit via OverrideDetector
> //can't rely on ordering because subsequent detector may
> //change Override's to a specialization of Override's
> if (detector instanceof OverrideDetector &&
> metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
> return detector.detect(input, metadata);
> }
> MediaType detected = detector.detect(input, metadata);
> if (registry.isSpecializationOf(detected, type)) {
> type = detected;
> }
> }
> return type;
> However since Image/tiff isn't a specialization of application/x-tar it does
> not replace the type with tiff.
> My fix was to add a "<sub-class-of type="application/x-tar"/>" to the
> definition for image/tiff in the tika-mimetypes.xml file
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)