[
https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379137#comment-16379137
]
daniel schmidt edited comment on TIKA-2591 at 2/27/18 9:04 PM:
---------------------------------------------------------------
It is a bit of a "useful hack" as they say.
But it's also kind of weird, the code is written to depend on
TarArchiveInputStream to throw an exception to be "not a tar". In this case,
for these images "success" is essentially failure?
It does seem odd to declare tiff a sub-type of tar, but that is where the code
lead me, since Tika see's it as a tar, but then later sees it as a .tiff.
Another option I considered was, in the ArchiveStreamFactory class (see code
below), actually guarding the construction of TarArchiveInputStream with a
conditional that checked the tarHeader variable to see if it started with one
of the TIFF magic numbers (II/MM 49 49 2A 00 / 4D 4D 00 2A).
For tiffs, they are there in the tarHeader, and you can check them and go to
the "simply not a tar" case and not rely on the TarArchiveInputStream
constructor or the getNextTarEntry method throwing an exception. That also
seemed a little goofy, but it also worked.
// add magic number checks here to this if statement, skip right to exception
throw:
if (signatureLength >= TAR_HEADER_SIZE) {
TarArchiveInputStream tais = null;
try {
tais = new TarArchiveInputStream(new
ByteArrayInputStream(tarHeader));
// COMPRESS-191 - verify the header checksum
if (tais.getNextTarEntry().isCheckSumOK()) {
return TAR;
}
} catch (final Exception e) { // NOPMD // NOSONAR
// can generate IllegalArgumentException as well
// as IOException
// autodetection, simply not a TAR
// ignored
} finally {
IOUtils.closeQuietly(tais);
}
}
throw new ArchiveException("No Archiver found for the stream
signature");
was (Author: schmiddc):
It is a bit of a "useful hack" as they say.
But it's also kind of weird, the code is written to depend on
TarArchiveInputStream to throw an exception to be "not a tar". In this case,
for these images "success" is essentially failure?
It does seem odd to declare tiff a sub-type of tar, but that is where the code
lead me, since Tika see's it as a tar, but then later see's it as a .tiff.
Another option I considered was actually guarding the construction
TarArchiveInputStream with a conditional that checked the header for the TIFF
magic numbers (II/MM 49 49 2A 00 / 4D 4D 00 2A). They are there, and you can
check them and go to the "simply not a tar" case without even throwing an
exception. That also seemed a little goofy, but it also worked.
try {
tais = new TarArchiveInputStream(new
ByteArrayInputStream(tarHeader));
// COMPRESS-191 - verify the header checksum
if (tais.getNextTarEntry().isCheckSumOK())
{ return TAR; }
} catch (final Exception e)
{ // NOPMD // NOSONAR
// can generate IllegalArgumentException as well
// as IOException
// autodetection, simply not a TAR
// ignored
}
> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> ---------------------------------------------------------------------
>
> Key: TIKA-2591
> URL: https://issues.apache.org/jira/browse/TIKA-2591
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 1.16
> Environment: Tika, running in a java application and a unit-test
> (windows and mac environments)
> Reporter: daniel schmidt
> Priority: Major
> Labels: newbie
> Fix For: 1.18
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting
> application/x-tar in Tika where it previously reported as a tiff
> (image/tiff).
> Observe this code in ArchiveStreamFactory, detect method.
> // COMPRESS-117 - improve auto-recognition
> if (signatureLength >= TAR_HEADER_SIZE) {
> TarArchiveInputStream tais = null;
> try {
> tais = new TarArchiveInputStream(new
> ByteArrayInputStream(tarHeader));
> // COMPRESS-191 - verify the header checksum
> if (tais.getNextTarEntry().isCheckSumOK()) {
> return TAR;
> }
> } catch (final Exception e) { // NOPMD // NOSONAR
> // can generate IllegalArgumentException as well
> // as IOException
> // autodetection, simply not a TAR
> // ignored
> } finally {
> IOUtils.closeQuietly(tais);
> }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail
> with an exception (i.e fall into the "simply not a tar" case). However this
> tiff actually does NOT fail here. This somewhat makes sense as the internal
> structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper
> tiff as it loops through its detectors in its detect method. It is detected
> as tiff in the MimeTypes class, which is one of the implementations of the
> Detector interface
>
> public MediaType detect(InputStream input, Metadata metadata)
> throws IOException {
> MediaType type = MediaType.OCTET_STREAM;
> for (Detector detector : getDetectors()) {
> //short circuit via OverrideDetector
> //can't rely on ordering because subsequent detector may
> //change Override's to a specialization of Override's
> if (detector instanceof OverrideDetector &&
> metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
> return detector.detect(input, metadata);
> }
> MediaType detected = detector.detect(input, metadata);
> if (registry.isSpecializationOf(detected, type)) {
> type = detected;
> }
> }
> return type;
> However since Image/tiff isn't a specialization of application/x-tar it does
> not replace the type with tiff.
> My fix was to add a "<sub-class-of type="application/x-tar"/>" to the
> definition for image/tiff in the tika-mimetypes.xml file
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)