Shuai Liu created TIKA-1517:
-------------------------------
Summary: MIME type selection with probability
Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement
Components: mime
Affects Versions: 1.6, 1.5, 1.4, 1.3, 1.2, 1.1, 1.0, 0.10, 0.9, 0.8, 0.7,
0.6, 0.5, 0.4, 0.3, 0.2, 0.1-incubating
Reporter: Shuai Liu
Problem and intuition
The original implementation in MIME type determination is a bit less flexible,
and it heavily relies on the outcome of magic-bytes; Thus e.g. if magic-bytes
is applicable in a file, Tika will follow the file type detected by magic-bytes.
This proposed approach slightly incorporate the Bayesian probability theorem,
where users are able to assign weights to each approach in terms of
probability, so they have the control over which file type or mime type
identification methods implemented/available in Tika, and currently there are 3
methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and
Metadata content-type hint). By introducing some weights on the approach in the
proposed approach, users choose which method they trust most, the magic-bytes
method is often trust-worthy though. But the virtue is that in some situations,
file type identification must be sensitive, some might want each of the MIME
type identification methods to arrive at the same file type before they start
processing those file, incorrect file type identification is less intolerable.
The current implementation seems to be less flexible and heavily rely on the
Magic-bytes file identification method (although magic-bytes is most reliable
compared to the other 2 currently being available in Tika);
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)