Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=10&rev2=11

  
  A-law companding function curve
  
- 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||height="260px;",width="561px;"}}
  
  Square-root function curve
  
  The following shows the difference
  
- 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||height="260px;",width="561px;"}}
  
  Byte frequencies '''without''' any companding.
  
- 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||height="260px;",width="561px;"}}
  
  Byte frequencies with A-law
  
@@ -95, +95 @@

  
  For details of A-Law, please refer to 
http://en.wikipedia.org/wiki/A-law_algorithm.
  
- 
{{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||transform="rotate(0.00rad)",height="251px;",width="541px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
{{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||height="251px;",width="541px;"}}
  
  Byte frequencies with square root (power of 1/2)
  
- 
{{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||transform="rotate(0.00rad)",height="251px;",width="541px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
{{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||height="251px;",width="541px;"}}
  
  Byte frequencies with power of 1/3
  
@@ -129, +129 @@

  
  The following line in main.R is the last line used to output the model, the 
name and structure can be customized according to different relish.
  
+ 
{{https://lh4.googleusercontent.com/9NhU8MSntrg9JRxV55sG89v5MkBM_ZzI9wo5SoYN3chzirIB_R97VImM4LUc6Cps1wJSfDlZCAE-OdCCj6OGBmeGHyKn8falen0APY1UY0B4xgCZ1EUEX3JVYcqxznNEQ2ygXpw||transform="rotate(0.00rad)",height="27px;",width="602px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
  The exportNNParams method implementation resides in the utility class i.e. 
‘myfunctions.R’; it can be also customized or replaced to create your own model 
file with different syntax or structure.
  
  The following shows what the outputted model look like in that model text 
file.
@@ -137, +139 @@

  
  The next line without # at the front shows a series of floating numbers 
separated by a tab, and they are model parameters, later we need to import the 
file into Tika and have the ExampleNNModelDetector to recreate the trained 
model with them in Tika so it can predict and classify the unseen file and 
determine with the imported model whether the given input file is a GRB or 
non-GRB type.
  
+ 
{{https://lh6.googleusercontent.com/ZkRhFs9ON4ELXTtClE9s0frCEsC_i7ktsWkmGlm10ktOCpJMorMB_UZA2K4pp6LIc8AK0c2LKhgss7ZQkhTop4eh9BBDYn-kQlC17PB21VUdMYjtvpHbUjY51XyS2iOgxSYjUIo||transform="rotate(0.00rad)",height="43px;",width="602px;",-webkit-transform="rotate(0.00rad)",border="none"}}
+ 
  The following shows the printing formation produced by the R program after 
training in a bit more detail with the outputted/chosen model above.
  
-  . [1] "Loading Dataset....." [1] "Begining Training Neural Networks" [1] 
"the length of weights 517" [1] "The time taken for training: 330.257000" [1] 
"The training error cost: 0.001380" [1] "The validation error cost: 0.025099" 
[1] "The testing error cost: 0.020883" [1] "Training Accuracy: 100.000000" [1] 
"Validation Accuracy: 99.650000" [1] "Testing Accuracy: 99.762349"
+  . [1] "Loading Dataset....."
+  [1] "Begining Training Neural Networks"
+  [1] "the length of weights 517"
+  [1] "The time taken for training: 330.257000"
+  [1] "The training error cost: 0.001380"
+  [1] "The validation error cost: 0.025099"
+  [1] "The testing error cost: 0.020883"
+  [1] "Training Accuracy: 100.000000"
+  [1] "Validation Accuracy: 99.650000"
+  [1] "Testing Accuracy: 99.762349"
  
  ''''Import the model into Tika ''''
  
@@ -149, +162 @@

  
  It is also possible that your model might use different size of input of byte 
histograms, some might consider a different bin size with some heuristics 
specific to their own data, in that case, it is possible to overwrite the 
readByteFrequencies(final InputStream input) :: TrainedModelDetector by 
providing your own version of byte histograms, and you also need to ensure the 
model parameters are used and set to reflect the same size of input.
  
+ TrainedModelDetector implements the Detector interface, but it is abstract 
meaning we need to subclass it with our own version of TrainedModelDetector.
-  . TrainedModelDetector implements the Detector interface, but it is abstract 
meaning we need to subclass it with our own version of TrainedModelDetector. 
ExampleNNModelDetector is its subclass, the purpose of subclassing the 
TrainedModelDetector is to supply the implementation of the method of 
loadDefaultModels that reads and registers the models into the model map 
<MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is 
populated with a set of mappings with keys and values, the detect method in the 
TrainedModelDetector will be able to use the loaded models to predict the mime 
types. The job of the TrainedModelDetector is to convert the given input stream 
to byte frequency histogram and pass that as the input to the models that have 
been loaded or registered in the map. There is also a TrainedModel(abstract) 
and its subclass NNTrainedModel. The TrainedModel is an abstract class that 
represents an abstraction of a trained model; a model object must have a method 
of “predict” with input of byte histogram vector, it returns a probability of 
prediction. The following lists all of the classes for this feature 
(tika\tika-core\src\main\java) org.apache.tika.detect.TrainedModelDetector 
(abstract) org.apache.tika.detect.ExampleNNModelDetector 
org.apache.tika.detect.TrainedModel (abstract) 
org.apache.tika.detect.NNTrainedModel Example model file 
(tika\tika-core\src\main\resources) org.apache.tika.detect.tika-example.nnmodel 
Unit test (tika\tika-core\src\test\java)
-   . org.apache.tika.detect. MimeDetectionWithNNTest
  
+ ExampleNNModelDetector is its subclass, the purpose of subclassing the 
TrainedModelDetector is to supply the implementation of the method of 
loadDefaultModels that reads and registers the models into the model map 
<MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is 
populated with a set of mappings with keys and values, the detect method in the 
TrainedModelDetector will be able to use the loaded models to predict the mime 
types.
+ 
+ The job of the TrainedModelDetector is to convert the given input stream to 
byte frequency histogram and pass that as the input to the models that have 
been loaded or registered in the map.
+ 
+ There is also a TrainedModel(abstract) and its subclass NNTrainedModel.
+ 
+ The TrainedModel is an abstract class that represents an abstraction of a 
trained model; a model object must have a method of “predict” with input of 
byte histogram vector, it returns a probability of prediction.
+ 
+ The following lists all of the classes for this feature 
(tika\tika-core\src\main\java)
+ 
+ org.apache.tika.detect.TrainedModelDetector (abstract)
+ 
+ org.apache.tika.detect.ExampleNNModelDetector
+ 
+ org.apache.tika.detect.TrainedModel (abstract)
+ 
+ org.apache.tika.detect.NNTrainedModel
+ 
+ Example model file (tika\tika-core\src\main\resources)
+ 
+ org.apache.tika.detect.tika-example.nnmodel
+ 
+ Unit test (tika\tika-core\src\test\java)
+ 
+                 org.apache.tika.detect. MimeDetectionWithNNTest
+ 

Reply via email to