Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "BaysianMimeTypeSelector" page has been changed by Lukeliush: https://wiki.apache.org/tika/BaysianMimeTypeSelector?action=diff&rev1=6&rev2=7 Therefore, it would be great to have a feature like Tika-1517 where user can add weights or preference on the detection method they want to use for detecitng mime types. - More information with this intuition of the feature - probabilistic mime type selection can be found in Tika 1517. The algorithm behind this feature is actually a simple or naive baysian rule, we eventually compute 3 scores each of which corresponds to a detection method, the one that gives the maximum scores or posterior probability will be returned as final detected type. + More information with this intuition of the feature - probabilistic mime type selection can be found in Tika 1517. The algorithm behind this feature is actually a simple or naive baysian rule, we eventually compute 3 scores each of which corresponds to a detection method, the one that gives the maximum score or posterior probability will be returned as final detected type. The idea can be also simply illustrated with the samlam bayesian network toolkit which is available for download on http://reasoning.cs.ucla.edu/samiam/index.php - The following screenshot with some examples illustrate the intuition and idea with the toolkit, please note under the hood, the same baysian network is implemented in this feature in Tika. + The following screenshot with some examples illustrates the intuition and idea with the toolkit, please note under the hood, the same baysian network is implemented in this feature in Tika. The important thing to be noted is that each detection method needs to be assigned with a conditional probability and prior, which could be all regarded as a value of trust. Initially - The prior for the type detected by each the detection method is 0.5, this forces our prior estimation to be stochastic; The virtue of the baysian rule is to incorperate the notion of condition probability, which in this case needs to be specified by the user either by the own preference or the domain knowledege where they know for certain which method they want to trust most and how much they want to trust; In order to the exact and intuitive value, this mentioned toolkit can be used to explore their conditional probability settings. + The prior for the type detected by each the detection method is 0.5, this forces our prior estimation to be stochastic; The virtue of the baysian rule is to incorperate the notion of condition probability, which in this case needs to be specified by users in terms of their own preference or the domain knowledege where they know for certain which method they want to trust most and how much they want to trust; In order to estimate or find the exact and intuitive value, this mentioned toolkit can be used to explore their conditional probability settings. - The following shows the initial conditional probability values used in the default setting of feature in Tika. + The following shows the initial conditional probability values used in the default setting for the feature in Tika. {{{ /* conditional probability: probability of the type estimated by the Magic test given that the type is the type predicted by the magic test. */ private static final float DEFAULT_MAGIC_TRUST = 0.9f;
