[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Sun, 10 May 2015 14:26:07 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=9&rev2=10

  
  some files some bytes have higher frequencies whereas other bytes are less 
frequent, or in a critical situation, some files have only one or two bins that 
occupy the majority of the count, this makes a large gap between the most 
frequent and less frequent, the solution is to apply a companding function - A 
law or u law; square-rooting the bin values also provide the same effect, so by 
considering the computational cost, the square-root is chosen to enhance the 
histogram detail in place of A law or u law.
  
- 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||height="260px;",width="561px;"}}
  
  A-law companding function curve
+ 
+ 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}}
  
  Square-root function curve
  
  The following shows the difference
  
+ 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
  Byte frequencies '''without''' any companding.
+ 
+ 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}}
  
  Byte frequencies with A-law
  
@@ -79, +85 @@

  alaw <- function(x, A=87.7){
  
   . th = 1/A
+  cond1 <- (x>=0 && x < th) cond2 <- (x>=th && x <= 1) x[cond1] <- A * 
abs(x[cond1]) / (1+log(A)) x[cond2] <- 
sign(x[cond2])*(1+log(A*abs(x[cond2])))/(1+log(A)) x
-  cond1 <- (x>=0 && x < th)
-  cond2 <- (x>=th && x <= 1)
-  x[cond1] <- A * abs(x[cond1]) / (1+log(A))
-  x[cond2] <- sign(x[cond2])*(1+log(A*abs(x[cond2])))/(1+log(A)) x
  
  }
  
@@ -92, +95 @@

  
  For details of A-Law, please refer to 
http://en.wikipedia.org/wiki/A-law_algorithm.
  
+ 
{{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||transform="rotate(0.00rad)",height="251px;",width="541px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
  Byte frequencies with square root (power of 1/2)
+ 
+ 
{{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||transform="rotate(0.00rad)",height="251px;",width="541px;",-webkit-transform="rotate(0.00rad)",border="none"}}
  
  Byte frequencies with power of 1/3
  
@@ -142, +149 @@

  
  It is also possible that your model might use different size of input of byte 
histograms, some might consider a different bin size with some heuristics 
specific to their own data, in that case, it is possible to overwrite the 
readByteFrequencies(final InputStream input) :: TrainedModelDetector by 
providing your own version of byte histograms, and you also need to ensure the 
model parameters are used and set to reflect the same size of input.
  
+  . TrainedModelDetector implements the Detector interface, but it is abstract 
meaning we need to subclass it with our own version of TrainedModelDetector. 
ExampleNNModelDetector is its subclass, the purpose of subclassing the 
TrainedModelDetector is to supply the implementation of the method of 
loadDefaultModels that reads and registers the models into the model map 
<MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is 
populated with a set of mappings with keys and values, the detect method in the 
TrainedModelDetector will be able to use the loaded models to predict the mime 
types. The job of the TrainedModelDetector is to convert the given input stream 
to byte frequency histogram and pass that as the input to the models that have 
been loaded or registered in the map. There is also a TrainedModel(abstract) 
and its subclass NNTrainedModel. The TrainedModel is an abstract class that 
represents an abstraction of a trained model; a model object must have a method 
of “predict” with input of byte histogram vector, it returns a probability of 
prediction. The following lists all of the classes for this feature 
(tika\tika-core\src\main\java) org.apache.tika.detect.TrainedModelDetector 
(abstract) org.apache.tika.detect.ExampleNNModelDetector 
org.apache.tika.detect.TrainedModel (abstract) 
org.apache.tika.detect.NNTrainedModel Example model file 
(tika\tika-core\src\main\resources) org.apache.tika.detect.tika-example.nnmodel 
Unit test (tika\tika-core\src\test\java)
-  . TrainedModelDetector implements the Detector interface, but it is abstract 
meaning we need to subclass it with our own version of TrainedModelDetector.
-  ExampleNNModelDetector is its subclass, the purpose of subclassing the 
TrainedModelDetector is to supply the implementation of the method of 
loadDefaultModels that reads and registers the models into the model map 
<MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is 
populated with a set of mappings with keys and values, the detect method in the 
TrainedModelDetector will be able to use the loaded models to predict the mime 
types.
-  The job of the TrainedModelDetector is to convert the given input stream to 
byte frequency histogram and pass that as the input to the models that have 
been loaded or registered in the map.
-  There is also a TrainedModel(abstract) and its subclass NNTrainedModel.
-  The TrainedModel is an abstract class that represents an abstraction of a 
trained model; a model object must have a method of “predict” with input of 
byte histogram vector, it returns a probability of prediction. The following 
lists all of the classes for this feature (tika\tika-core\src\main\java)
-  org.apache.tika.detect.TrainedModelDetector (abstract) 
org.apache.tika.detect.ExampleNNModelDetector
-  org.apache.tika.detect.TrainedModel (abstract) 
org.apache.tika.detect.NNTrainedModel Example model file 
(tika\tika-core\src\main\resources) org.apache.tika.detect.tika-example.nnmodel 
Unit test (tika\tika-core\src\test\java)
    . org.apache.tika.detect. MimeDetectionWithNNTest

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to