Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "BaysianMimeTypeSelector" page has been changed by Lukeliush: https://wiki.apache.org/tika/BaysianMimeTypeSelector?action=diff&rev1=1&rev2=2 [[https://issues.apache.org/jira/browse/TIKA-1517|TIKA-1517 [MIME type selection with probability]]] - The motivation is that the current implemenation within MimeTypes for detecting mime types in Tika is a bit stiff and less flexible(at the time the article is being written, the current version of MimeTypes which has 3 detection approaches to identify mime types is implemented with a fall-back strategy), the detection highly depends on the magic byte detection and the last two approaches (i.e. extension and metatdatahint matching) are subsidiary and auxiliary in the final decsion. In other words, the decision that comes from the last two approach will not only be considered when there is a tie to break in the magic bytes detection as there might be multiple mime types estimated by magic bytes method, in this situation extension and metadatahint will be used. It is possible that in some situation the type given by extension and metadata hint matching are more specialized than magic bytes method, then the most specialized or specific type gets returned. This implementation seems to exhibt a bit inflexibilities when users want to + The motivation is that the current implemenation within MimeTypes for detecting mime types in Tika is a bit stiff and less flexible(at the time the article is being written, the current version of MimeTypes which has 3 detection approaches to identify mime types is implemented with a fall-back strategy), the detection highly depends on the magic byte detection and the last two approaches (i.e. extension and metatdatahint matching) are subsidiary and auxiliary in the final decsion. In other words, the decision that comes from the last two approach will not only be considered when there is a tie to break in the magic bytes detection as there might be multiple mime types estimated by magic bytes method, in this situation file extension and metadatahint will be used. It is also possible that in some situations the type given by the file extension and metadata hint matching are more specialized than magic bytes method, then the most specialized or specific type gets returned. This implementation seems to exhibt a bit inflexibilities in some situation where users a particular type of detection e.g. they might only trust their file extensions. Perhaps, in the future we might have more probablistic mime detection algorithms being considered for deploying into Tika, cerntainly from this perspective, the current implementation seems to give less space for extending with more detection methods in Tika. + Therefore, it would be great to have something like this feature where user can add weights or preference on the method they want to use for detecitng mime types. + More information with this intuition of the feature of probabilistic mime type selection can be found in Tika 1517. The algorithm behind this feature is actually a simple or naive baysian rule, we eventually compute 3 scores each of which corresponds to a detection method, the one that gives the maximum scores or posterior probability will be returned as final detected type. + The idea can be also simply illustrated with the samlam bayesian network toolkit which is available for download on http://reasoning.cs.ucla.edu/samiam/index.php + The following screenshot with some examples illustrate the intuition and idea with the toolkit, please note under the hood, the same baysian network is implemented in this feature. + + The important thing to be noted is that each detection method needs to be assigned with a conditional probability and prior, which could be all regarded as a value of trust. + + Initially + + The prior for the type detected by each the detection method is 0.5, this force our prior estimation to be stochastic; The virtue of baysian rule is to incorperate the notion of condition probability, which needs to be specified by the user either by the own preference or the domain knowledege where they know for certain which method they want to trust most. + + The following shows the initial conditional probability value used in the default setting of feature in Tika. + + // conditional probability: probability of the type estimated by the Magic test given that the type is the type predicted by the magic test. + + private static final float DEFAULT_MAGIC_TRUST = 0.9f; + + // conditional probability: probability of the type estimated by the metadata hint test given that the type is the type predicted by the metadata hint test. + + private static final float DEFAULT_META_TRUST = 0.8f; + + // conditional probability: probability of the type estimated by the extension test given that the type is the type predicted by the extension test. + + . private static final float DEFAULT_EXTENSION_TRUST = 0.8f; + + // conditional probability: probability of the type Not estimated by the Magic test given that the type is not the type predicted by the magic test. + + 0.9f + + // conditional probability: probability of the type Not estimated by the meadata test given that the type is not the type predicted by the metadata test. + + 0.8f + + // conditional probability: probability of the type Not estimated by the extension test given that the type is not the type predicted by the extension test. + + 0.8f + + {{https://lh4.googleusercontent.com/wUrj6keimRl2xjZjxutax0jXyhIbHBkOtNZTwLoHmU2q_VFq_fVxLBP1YRv29teUZdbdvXunj7xiTtC3GLtCErsg_yz2NaftfVO9YF9rmdD18QRgaHoYJ21k-kEb2WrOw6NST8U||height="381px;",width="563px;"}} + + The above tells that the file extension and contentMetadata matching method fail to detect a type (i.e. they both return byte-stream as the type), and only the magic test returns a non-byte-stream type, then the type estimated by magic test has a higher posterior probability 90%; + + {{https://lh5.googleusercontent.com/FO5VfTBjcTuogGvGz0XecZ_3oSABCNdswjFum-ILO1_37VPK6WiSD8vXMggPNPvGRagyXvrymeQU9iM9Ta69vVJyLT7guSE-6srV5efQMHKHSDdO0H_ZU4tjHaga2wlY_c_QBEE||height="364px;",width="562px;"}} + + The above screenshot shows magic Test and file extension method do not agree on the same type, the posterior probability of the type estimated by magic test is 69.23% + + {{https://lh4.googleusercontent.com/3Q68oP4-IFoAP5woGiqiaSLN_JwN11ZBylS2l9NlmPQg16ZxGezBK6vxBxl4sEz5xOgy_A-2pfAMgz2bmy4gEPu829tYAnnWG03rQaV-tLKs_j_B-WZ5kwGwF_oOSJeGDwF6SwU||height="368px;",width="595px;"}} + + This one shows the posterior probability of the type estimated by the extension matching method, and the magic test does not agree with it on the same type, so the type predicted by the glob test is lower and it is 30.77% +
