Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "BaysianMimeTypeSelector" page has been changed by Lukeliush:
https://wiki.apache.org/tika/BaysianMimeTypeSelector?action=diff&rev1=3&rev2=4

  
  [[https://issues.apache.org/jira/browse/TIKA-1517|TIKA-1517 [MIME type 
selection with probability]]]
  
- The motivation is that the current implemenation within MimeTypes for 
detecting mime types in Tika is a bit stiff and less flexible(at the time the 
article is being written, the current version of MimeTypes which has 3 
detection approaches to identify mime types is implemented with a fall-back 
strategy), the detection highly depends on the magic byte detection and the 
last two approaches (i.e. extension and metatdatahint matching) are subsidiary 
and auxiliary in the final decsion. In other words, the decision that comes 
from the last two approach will not only be considered when there is a tie to 
break in the magic bytes detection as there might be multiple mime types 
estimated by magic bytes method, in this situation file extension and 
metadatahint will be used. It is also possible that in some situations the type 
given by the file extension and metadata hint matching are more specialized 
than magic bytes method, then the most specialized or specific type gets 
returned. This implementation seems to exhibt a bit inflexibilities in some 
situation where users a particular type of detection e.g. they might only trust 
their file extensions. Perhaps, in the future we might have more probablistic 
mime detection algorithms being considered for deploying into Tika, cerntainly 
from this perspective, the current implementation seems to give less space for 
extending with more detection methods in Tika.
+ The motivation is that the current implemenation within MimeTypes for 
detecting mime types in Tika is a bit stiff and less flexible(at the time the 
article is being written, the current version of MimeTypes which has 3 
detection approaches to identify mime types is implemented with a fall-back 
strategy), the detection highly depends on the magic byte detection. T
  
- Therefore, it would be great to have something like this feature where user 
can add weights or preference on the method they want to use for detecitng mime 
types.
+ he last two approaches (i.e. extension and metatdatahint matching) are 
subsidiary and auxiliary in the final detection decsion. In other words, the 
decision that comes from the last two approach will probablly be considered 
when there is a tie to break in the magic bytes detection as there might be 
multiple mime types estimated by magic bytes method, in this situation file 
extension and metadatahint will be used.
  
+ It is also possible that in some situations the type given by the file 
extension and metadata hint matching are more specialized than magic bytes 
method, then the most specialized or specific type gets returned. This 
implementation seems to exhibt a bit inflexibilities in some situations where 
users prefer a particular type of detection e.g. they might only trust or 
prefer their file extensions.
+ 
+ Perhaps, in the future we might have more probablistic mime detection 
algorithms being considered for deploying into Tika, probably from this 
perspective, the current implementation also seems to give less space for 
extending with more detection methods in Tika.
+ 
+ Therefore, it would be great to have a feature like Tika-1517 where user can 
add weights or preference on the detection method they want to use for 
detecitng mime types.
+ 
- More information with this intuition of the feature of probabilistic mime 
type selection can be found in Tika 1517. The algorithm behind this feature is 
actually a simple or naive baysian rule, we eventually compute 3 scores each of 
which corresponds to a detection method, the one that gives the maximum scores 
or posterior probability will be returned as final detected type.
+ More information with this intuition of the feature - probabilistic mime type 
selection can be found in Tika 1517. The algorithm behind this feature is 
actually a simple or naive baysian rule, we eventually compute 3 scores each of 
which corresponds to a detection method, the one that gives the maximum scores 
or posterior probability will be returned as final detected type.
  
  The idea can be also simply illustrated with the samlam bayesian network 
toolkit which is available for download on 
http://reasoning.cs.ucla.edu/samiam/index.php
  
- The following  screenshot with some examples illustrate the intuition and 
idea with the toolkit, please note under the hood, the same baysian network is 
implemented in this feature.
+ The following  screenshot with some examples illustrate the intuition and 
idea with the toolkit, please note under the hood, the same baysian network is 
implemented in this feature in Tika.
  
  The important thing to be noted is that each detection method needs to be 
assigned with a conditional probability and prior, which could be all regarded 
as a value of trust.
  
  Initially
  
- The prior for the type detected by each the detection method is 0.5, this 
force our prior estimation to be stochastic; The virtue of baysian rule is to 
incorperate the notion of condition probability, which needs to be specified by 
the user either by the own preference or the domain knowledege where they know 
for certain which method they want to trust most.
+ The prior for the type detected by each the detection method is 0.5, this 
forces our prior estimation to be stochastic; The virtue of the baysian rule is 
to incorperate the notion of condition probability, which in this case needs to 
be specified by the user either by the own preference or the domain knowledege 
where they know for certain which method they want to trust most and how much 
they want to trust; In order to the exact and intuitive value, this mentioned 
toolkit can be used to explore their conditional probability settings.
  
- The following shows the initial conditional probability value used in the 
default setting of feature in Tika.
+ The following shows the initial conditional probability values used in the 
default setting of feature in Tika.
  
  // conditional probability: probability of the type estimated by the Magic 
test given that the type is the type predicted by the magic test.
  
@@ -60, +66 @@

  
  Kindly note, usually the metadata hint is present in the http response 
header, if we bluntly or directly give a file to the feature in Tika, the meta 
data hint usually is absent.
  
+ 
----------------------------------------------------------------------------------------------------------
+ 
+ How to use this feature in Tika (proposed use)
+ 
+ If a user wanted to use this feature, the following code would be needed.
+ 
+ {{{
+ tika = new Tika(new TikaConfig() {
+                 @Override
+                 protected Detector getDefaultDetector(MimeTypes types,
+                         ServiceLoader loader) {
+                     /*
+                      * here is an example with the use of the builder to
+                      * instantiate the object.
+                      */
+                     Builder builder = new 
ProbabilisticMimeDetectionSelector.Builder();
+                     ProbabilisticMimeDetectionSelector proDetector = new 
ProbabilisticMimeDetectionSelector(
+                             types, builder.priorMagicFileType(0.5f)
+                                     .priorExtensionFileType(0.5f)
+                                     .priorMetaFileType(0.5f));
+                     return new DefaultProbDetector(proDetector, loader);
+                 }
+             });
+ }}}
+  The idea is simple that we overwrite the getDefaultDetector() by  providing 
the DefaultProbDetector which extends the CompositeDetector, a  
CompositeDetector is one (whose supertype is Detector) that takes a  list of 
detectors, and when its detect() method gets called, each  detector in the list 
is called sequentially one after another. The  original implemenation of 
getDefaultDetector() in TikaConfig returns an  instance of “DefaultDetector” 
that also extends the CompositeDetector by  providing a list of detectors that 
includes “MimeTypes” which is the  native implemenation with 3 detectors (i.e. 
magic bytes, extension and  metadatahint). However, DefaultProbDetector 
replaces this MimeTypes with  ProbabilisticMimeDetectionSelector.
+ 
+ In order to set the preferential weights, an instance of  
ProbabilisticMimeDetectionSelector can be created in the above example  snippet.
+ 
+ Alternatively, if we dont want to go with the default settings with  
ProbabilisticMimeDetectionSelector, it is ok to just ignore the  arguments by 
calling only “return new DefaultProbDetector();”.
+  Alternatively, if we dont want to write some extra code, the following can 
also be used.
+ 
+ {{{
+ /*
+              * an xml file needs to be given to tell where the detector is
+              * located, customers can build their own detectors by including 
or
+              * excluding this feature or any detectors in the composite list 
at
+              * their will.
+              */
+             tika = new Tika(new TikaConfig(
+                     new File("TIKA-detector-sample.xml")));
+ }}}
+ '''TIKA-detector-sample.xml'''
+ 
+ <?xml version="1.0" encoding="UTF-8"?>
+ <!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+ -->
+ <properties>
+   <detectors>
+     <detector class="org.apache.tika.detect.DefaultProbDetector"/>
+   </detectors>
+ </properties>
+ 

Reply via email to