[
https://issues.apache.org/jira/browse/TIKA-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145299#comment-13145299
]
PNS edited comment on TIKA-697 at 11/7/11 9:42 AM:
---------------------------------------------------
Detection of Unix AR archive types (see http://en.wikipedia.org/wiki/Ar_(Unix))
is very simple and can indeed be done either by checking for the 8 "magic"
bytes (0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A).
What needs to be changed in the Tika code is at least the TextDetector.detect()
method, so that it returns an AR media type if the first 8 bytes of the archive
are the AR signature.
The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and
it will probably be a custom one, since apparently there is no IANA-registered
MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats and
http://www.iana.org/assignments/media-types/index.html).
Assuming the existence of a statement like
{code}
public static final MediaType APPLICATION_AR = application("x-ar");
{code}
in class *org.apache.tika.mime.MediaType*, following is a quick implementation
of the proposed changes in the *TextDetector.detect()* method:
{code}
// Code immediately after the static initialization block of the
IS_CONTROL[] array
private static final byte[] AR_HEADER = new byte[]
{0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A};
private boolean checkArHeader;
@Override
public MediaType detect(InputStream input, Metadata metadata)
throws IOException {
if (input == null) {
return MediaType.OCTET_STREAM;
}
input.mark(NUMBER_OF_BYTES_TO_TEST);
checkArHeader = true;
try {
for (int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) {
int ch = input.read();
if (ch == -1) {
if (i > 0) {
return MediaType.TEXT_PLAIN;
} else {
// See
https://issues.apache.org/jira/browse/TIKA-483
return MediaType.OCTET_STREAM;
}
} else if (ch < IS_CONTROL_BYTE.length &&
IS_CONTROL_BYTE[ch]) {
return MediaType.OCTET_STREAM;
} else if (checkArHeader) {
// See
https://issues.apache.org/jira/browse/TIKA-697
if ((i>7) || (AR_HEADER[i] != ch)) {
checkArHeader = false;
} else if ((i==7) && (AR_HEADER[i] ==
ch)) {
return MediaType.APPLICATION_AR;
}
}
}
return MediaType.TEXT_PLAIN;
} finally {
input.reset();
}
}
{code}
Essentially, the additions are just the new MediaType.APPLICATION_AR constant,
the 2 new variables (AR_HEADER, checkArHeader) and the "else if
(checkArHeader)" control block.
I have tested the above with numerous combinations of files and it works as
expected.
was (Author: pns):
Detection of Unix AR archive types (see
http://en.wikipedia.org/wiki/Ar_(Unix)) is very simple and can indeed be done
either by checking for the 8 "magic" bytes (0x21, 0x3C, 0x61, 0x72, 0x63, 0x68,
0x3E, 0x0A).
What needs to be changed in the Tika code is at least the TextDetector.detect()
method, so that it returns an AR media type if the first 8 bytes of the archive
are the AR signature.
The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and
it will probably be a custom one, since apparently there is no IANA-registered
MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats and
http://www.iana.org/assignments/media-types/index.html).
Assuming the existence of a statement like
{code}
public static final MediaType APPLICATION_AR = application("x-ar");
{code}
in class *org.apache.tika.mime.MediaType*, following is a quick implementation
of the proposed changes in the *TextDetector.detect()* method:
{code}
// Code immediately after the static initialization block of the
IS_CONTROL[] array
private static final byte[] AR_HEADER = new byte[]
{0x21, 0x3c, 0x61, 0x72, 0x63, 0x68, 0x3e, 0x0a};
private boolean checkArHeader;
@Override
public MediaType detect(InputStream input, Metadata metadata)
throws IOException {
if (input == null) {
return MediaType.OCTET_STREAM;
}
input.mark(NUMBER_OF_BYTES_TO_TEST);
checkArHeader = true;
try {
for (int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) {
int ch = input.read();
if (ch == -1) {
if (i > 0) {
return MediaType.TEXT_PLAIN;
} else {
// See
https://issues.apache.org/jira/browse/TIKA-483
return MediaType.OCTET_STREAM;
}
} else if (ch < IS_CONTROL_BYTE.length &&
IS_CONTROL_BYTE[ch]) {
return MediaType.OCTET_STREAM;
} else if (checkArHeader) {
// See
https://issues.apache.org/jira/browse/TIKA-697
if ((i>7) || (AR_HEADER[i] != ch)) {
checkArHeader = false;
} else if ((i==7) && (AR_HEADER[i] ==
ch)) {
return MediaType.APPLICATION_AR;
}
}
}
return MediaType.TEXT_PLAIN;
} finally {
input.reset();
}
}
{code}
Essentially, the additions are just the new MediaType.APPLICATION_AR constant,
the 2 new variables (AR_HEADER, checkArHeader) and the "else if
(checkArHeader)" control block.
I have tested the above with numerous combinations of files and it works as
expected.
> Tika reports the content type of AR archives as "text/plain"
> ------------------------------------------------------------
>
> Key: TIKA-697
> URL: https://issues.apache.org/jira/browse/TIKA-697
> Project: Tika
> Issue Type: Bug
> Environment: Linux (CentOS 5.6)
> Reporter: PNS
> Priority: Trivial
>
> The Tika.detect(InputStream) method returns "text/plain" for AR archives
> created with the Linux "Create Archive" option of Nautilus (available via
> right-clicking on a file).
> The Apache Commons Compress "autodetection" code of the ArchiveStreamFactory
> looks at the first 12 bytes of the stream and correctly identifies the type
> as AR.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira