[
https://issues.apache.org/jira/browse/TIKA-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eduardas Kazakas updated TIKA-3833:
-----------------------------------
Description:
Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1).
I am trying to detect the MIME type of a bzip2 file and, instead of
application/x-bzip2, I am getting application/x-bzip. I believe it has
something to do with the mime-type definitions in the
tika-mimetypes.xml file.
{code:java}
<mime-type type="application/x-bzip">
<magic priority="40">
<match value="BZh" type="string" offset="0"/>
</magic>
<glob pattern="*.bz"/>
<glob pattern="*.tbz"/>
</mime-type>
<mime-type type="application/x-bzip2">
<sub-class-of type="application/x-bzip"/>
<_comment>Bzip 2 UNIX Compressed File</_comment>
<magic priority="40">
<match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/>
</magic>
<glob pattern="*.bz2"/>
<glob pattern="*.tbz2"/>
<glob pattern="*.boz"/>
</mime-type>{code}
The priority for these is set to 40, I believe that the priority of
application/x-bzip2 should be higher, because string value "BZh" and
hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh.
Maybe I am missing something here? Does this look like a bug or this
works as intended? Maybe I can provide some sort of hint for the
default detector?
A small example in Scala:
{code:java}
import org.apache.tika.config.TikaConfig
import org.apache.tika.detect.DefaultProbDetector
import org.apache.tika.metadata.{Metadata, TikaCoreProperties}
import java.io.{BufferedInputStream, File, FileInputStream}
object AAA {
def main(args: Array[String]): Unit = {
val config = TikaConfig.getDefaultConfig
val file = new File("/home/ekazakas/test.csv.bz2")
val detector = new DefaultProbDetector()
val mediaType = detector.detect(new BufferedInputStream(new
FileInputStream(file)), new Metadata)
val mimeType = config.getMimeRepository.forName(mediaType.toString)
println(mimeType)
}
} {code}
This prints `application/x-bzip` instead of `application/x-bzip2`.
was:
Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1).
I am trying to detect the MIME type of a bzip2 file and, instead of
application/x-bzip2, I am getting application/x-bzip. I believe it has
something to do with the mime-type definitions in the
tika-mimetypes.xml file.
<mime-type type="application/x-bzip">
<magic priority="40">
<match value="BZh" type="string" offset="0"/>
</magic>
<glob pattern="*.bz"/>
<glob pattern="*.tbz"/>
</mime-type>
<mime-type type="application/x-bzip2">
<sub-class-of type="application/x-bzip"/>
<_comment>Bzip 2 UNIX Compressed File</_comment>
<magic priority="40">
<match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/>
</magic>
<glob pattern="*.bz2"/>
<glob pattern="*.tbz2"/>
<glob pattern="*.boz"/>
</mime-type>
The priority for these is set to 40, I believe that the priority of
application/x-bzip2 should be higher, because string value "BZh" and
hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh.
Maybe I am missing something here? Does this look like a bug or this
works as intended? Maybe I can provide some sort of hint for the
default detector?
A small example in Scala:
{code:java}
import org.apache.tika.config.TikaConfig
import org.apache.tika.detect.DefaultProbDetector
import org.apache.tika.metadata.{Metadata, TikaCoreProperties}
import java.io.{BufferedInputStream, File, FileInputStream}
object AAA {
def main(args: Array[String]): Unit = {
val config = TikaConfig.getDefaultConfig
val file = new File("/home/ekazakas/test.csv.bz2")
val detector = new DefaultProbDetector()
val mediaType = detector.detect(new BufferedInputStream(new
FileInputStream(file)), new Metadata)
val mimeType = config.getMimeRepository.forName(mediaType.toString)
println(mimeType)
}
} {code}
This prints `application/x-bzip` instead of `application/x-bzip2`.
> bzip2 MIME type is detected as bzip instead when using tika-core
> ----------------------------------------------------------------
>
> Key: TIKA-3833
> URL: https://issues.apache.org/jira/browse/TIKA-3833
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 2.4.1
> Reporter: Eduardas Kazakas
> Priority: Major
>
> Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1).
> I am trying to detect the MIME type of a bzip2 file and, instead of
> application/x-bzip2, I am getting application/x-bzip. I believe it has
> something to do with the mime-type definitions in the
> tika-mimetypes.xml file.
> {code:java}
> <mime-type type="application/x-bzip">
> <magic priority="40">
> <match value="BZh" type="string" offset="0"/>
> </magic>
> <glob pattern="*.bz"/>
> <glob pattern="*.tbz"/>
> </mime-type>
> <mime-type type="application/x-bzip2">
> <sub-class-of type="application/x-bzip"/>
> <_comment>Bzip 2 UNIX Compressed File</_comment>
> <magic priority="40">
> <match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/>
> </magic>
> <glob pattern="*.bz2"/>
> <glob pattern="*.tbz2"/>
> <glob pattern="*.boz"/>
> </mime-type>{code}
> The priority for these is set to 40, I believe that the priority of
> application/x-bzip2 should be higher, because string value "BZh" and
> hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh.
> Maybe I am missing something here? Does this look like a bug or this
> works as intended? Maybe I can provide some sort of hint for the
> default detector?
> A small example in Scala:
> {code:java}
> import org.apache.tika.config.TikaConfig
> import org.apache.tika.detect.DefaultProbDetector
> import org.apache.tika.metadata.{Metadata, TikaCoreProperties}
> import java.io.{BufferedInputStream, File, FileInputStream}
> object AAA {
> def main(args: Array[String]): Unit = {
> val config = TikaConfig.getDefaultConfig
> val file = new File("/home/ekazakas/test.csv.bz2")
> val detector = new DefaultProbDetector()
> val mediaType = detector.detect(new BufferedInputStream(new
> FileInputStream(file)), new Metadata)
> val mimeType = config.getMimeRepository.forName(mediaType.toString)
> println(mimeType)
> }
> } {code}
> This prints `application/x-bzip` instead of `application/x-bzip2`.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)