Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "BaysianMimeTypeSelector" page has been changed by Lukeliush:
https://wiki.apache.org/tika/BaysianMimeTypeSelector?action=diff&rev1=6&rev2=7

  
  Therefore, it would be great to have a feature like Tika-1517 where user can 
add weights or preference on the detection method they want to use for 
detecitng mime types.
  
- More information with this intuition of the feature - probabilistic mime type 
selection can be found in Tika 1517. The algorithm behind this feature is 
actually a simple or naive baysian rule, we eventually compute 3 scores each of 
which corresponds to a detection method, the one that gives the maximum scores 
or posterior probability will be returned as final detected type.
+ More information with this intuition of the feature - probabilistic mime type 
selection can be found in Tika 1517. The algorithm behind this feature is 
actually a simple or naive baysian rule, we eventually compute 3 scores each of 
which corresponds to a detection method, the one that gives the maximum score 
or posterior probability will be returned as final detected type.
  
  The idea can be also simply illustrated with the samlam bayesian network 
toolkit which is available for download on 
http://reasoning.cs.ucla.edu/samiam/index.php
  
- The following  screenshot with some examples illustrate the intuition and 
idea with the toolkit, please note under the hood, the same baysian network is 
implemented in this feature in Tika.
+ The following  screenshot with some examples illustrates the intuition and 
idea with the toolkit, please note under the hood, the same baysian network is 
implemented in this feature in Tika.
  
  The important thing to be noted is that each detection method needs to be 
assigned with a conditional probability and prior, which could be all regarded 
as a value of trust.
  
  Initially
  
- The prior for the type detected by each the detection method is 0.5, this 
forces our prior estimation to be stochastic; The virtue of the baysian rule is 
to incorperate the notion of condition probability, which in this case needs to 
be specified by the user either by the own preference or the domain knowledege 
where they know for certain which method they want to trust most and how much 
they want to trust; In order to the exact and intuitive value, this mentioned 
toolkit can be used to explore their conditional probability settings.
+ The prior for the type detected by each the detection method is 0.5, this 
forces our prior estimation to be stochastic; The virtue of the baysian rule is 
to incorperate the notion of condition probability, which in this case needs to 
be specified by users in terms of their own preference or the domain knowledege 
where they know for certain which method they want to trust most and how much 
they want to trust; In order to estimate or find the exact and intuitive value, 
this mentioned toolkit can be used to explore their conditional probability 
settings.
  
- The following shows the initial conditional probability values used in the 
default setting of feature in Tika.
+ The following shows the initial conditional probability values used in the 
default setting for the feature in Tika.
  
  {{{
  /* conditional probability: probability of the type estimated by the Magic 
test given that the type is the type predicted by the magic test.  */ private 
static final float DEFAULT_MAGIC_TRUST = 0.9f;

Reply via email to