[Tika Wiki] Update of "BaysianMimeTypeSelector" by Lukeliush

Apache Wiki Sun, 26 Apr 2015 15:17:03 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "BaysianMimeTypeSelector" page has been changed by Lukeliush:
https://wiki.apache.org/tika/BaysianMimeTypeSelector?action=diff&rev1=1&rev2=2

  
  [[https://issues.apache.org/jira/browse/TIKA-1517|TIKA-1517 [MIME type 
selection with probability]]]
  
- The motivation is that the current implemenation within MimeTypes for 
detecting mime types in Tika is a bit stiff and less flexible(at the time the 
article is being written, the current version of MimeTypes which has 3 
detection approaches to identify mime types is implemented with a fall-back 
strategy), the detection highly depends on the magic byte detection and the 
last two approaches (i.e. extension and metatdatahint matching) are subsidiary 
and auxiliary in the final decsion. In other words, the decision that comes 
from the last two approach will not only be considered when there is a tie to 
break in the magic bytes detection as there might be multiple mime types 
estimated by magic bytes method, in this situation extension and metadatahint 
will be used. It is possible that in some situation the type given by extension 
and metadata hint matching are more specialized than magic bytes method, then 
the most specialized or specific type gets returned. This implementation seems 
to exhibt a bit inflexibilities when users want to
+ The motivation is that the current implemenation within MimeTypes for 
detecting mime types in Tika is a bit stiff and less flexible(at the time the 
article is being written, the current version of MimeTypes which has 3 
detection approaches to identify mime types is implemented with a fall-back 
strategy), the detection highly depends on the magic byte detection and the 
last two approaches (i.e. extension and metatdatahint matching) are subsidiary 
and auxiliary in the final decsion. In other words, the decision that comes 
from the last two approach will not only be considered when there is a tie to 
break in the magic bytes detection as there might be multiple mime types 
estimated by magic bytes method, in this situation file extension and 
metadatahint will be used. It is also possible that in some situations the type 
given by the file extension and metadata hint matching are more specialized 
than magic bytes method, then the most specialized or specific type gets 
returned. This implementation seems to exhibt a bit inflexibilities in some 
situation where users a particular type of detection e.g. they might only trust 
their file extensions. Perhaps, in the future we might have more probablistic 
mime detection algorithms being considered for deploying into Tika, cerntainly 
from this perspective, the current implementation seems to give less space for 
extending with more detection methods in Tika.
  
+ Therefore, it would be great to have something like this feature where user 
can add weights or preference on the method they want to use for detecitng mime 
types.
  
+ More information with this intuition of the feature of probabilistic mime 
type selection can be found in Tika 1517. The algorithm behind this feature is 
actually a simple or naive baysian rule, we eventually compute 3 scores each of 
which corresponds to a detection method, the one that gives the maximum scores 
or posterior probability will be returned as final detected type.
  
+ The idea can be also simply illustrated with the samlam bayesian network 
toolkit which is available for download on 
http://reasoning.cs.ucla.edu/samiam/index.php
  
+ The following  screenshot with some examples illustrate the intuition and 
idea with the toolkit, please note under the hood, the same baysian network is 
implemented in this feature.
+ 
+ The important thing to be noted is that each detection method needs to be 
assigned with a conditional probability and prior, which could be all regarded 
as a value of trust.
+ 
+ Initially
+ 
+ The prior for the type detected by each the detection method is 0.5, this 
force our prior estimation to be stochastic; The virtue of baysian rule is to 
incorperate the notion of condition probability, which needs to be specified by 
the user either by the own preference or the domain knowledege where they know 
for certain which method they want to trust most.
+ 
+ The following shows the initial conditional probability value used in the 
default setting of feature in Tika.
+ 
+ // conditional probability: probability of the type estimated by the Magic 
test given that the type is the type predicted by the magic test.
+ 
+ private static final float DEFAULT_MAGIC_TRUST = 0.9f;
+ 
+ // conditional probability: probability of the type estimated by the metadata 
hint test given that the type is the type predicted by the metadata hint test.
+ 
+ private static final float DEFAULT_META_TRUST = 0.8f;
+ 
+ // conditional probability: probability of the type estimated by the 
extension test given that the type is the type predicted by the extension test.
+ 
+  . private static final float DEFAULT_EXTENSION_TRUST = 0.8f;
+ 
+ // conditional probability: probability of the type Not estimated by the 
Magic test given that the type is not the type predicted by the magic test.
+ 
+ 0.9f
+ 
+ // conditional probability: probability of the type Not estimated by the 
meadata test given that the type is not the type predicted by the metadata test.
+ 
+ 0.8f
+ 
+ // conditional probability: probability of the type Not estimated by the 
extension test given that the type is not the type predicted by the extension 
test.
+ 
+ 0.8f
+ 
+ 
{{https://lh4.googleusercontent.com/wUrj6keimRl2xjZjxutax0jXyhIbHBkOtNZTwLoHmU2q_VFq_fVxLBP1YRv29teUZdbdvXunj7xiTtC3GLtCErsg_yz2NaftfVO9YF9rmdD18QRgaHoYJ21k-kEb2WrOw6NST8U||height="381px;",width="563px;"}}
+ 
+ The above tells that the file extension and contentMetadata matching method 
fail to detect a type (i.e. they both return byte-stream as the type), and only 
the magic test returns a non-byte-stream type, then the type estimated by magic 
test has a higher posterior probability 90%;
+ 
+ 
{{https://lh5.googleusercontent.com/FO5VfTBjcTuogGvGz0XecZ_3oSABCNdswjFum-ILO1_37VPK6WiSD8vXMggPNPvGRagyXvrymeQU9iM9Ta69vVJyLT7guSE-6srV5efQMHKHSDdO0H_ZU4tjHaga2wlY_c_QBEE||height="364px;",width="562px;"}}
+ 
+ The above screenshot shows magic Test and file extension method do not agree 
on the same type, the posterior probability of the type estimated by magic test 
is 69.23%
+ 
+ 
{{https://lh4.googleusercontent.com/3Q68oP4-IFoAP5woGiqiaSLN_JwN11ZBylS2l9NlmPQg16ZxGezBK6vxBxl4sEz5xOgy_A-2pfAMgz2bmy4gEPu829tYAnnWG03rQaV-tLKs_j_B-WZ5kwGwF_oOSJeGDwF6SwU||height="368px;",width="595px;"}}
+ 
+ This one shows the posterior probability of the type estimated by the 
extension matching method, and the magic test does not agree with it on the 
same type, so the type predicted by the glob test is lower and it is 30.77%
+

[Tika Wiki] Update of "BaysianMimeTypeSelector" by Lukeliush

Reply via email to