[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Sun, 10 May 2015 14:23:57 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=8&rev2=9

  Tika - Content based MIME type Detection
- 
- 
- 
- 
  
  JIRA issue with the TIKA feature
  
@@ -12, +8 @@

  
  '''Motivation'''
  
- 
- 
- 
- 
- The feature of TIKA-1582 is an extension of TIKA MIME detection based on file 
contents, i.e. the file byte histograms, and this feature follows a standard 
data mining process that extracts the knowledge out of the data (bytes). The 
motivation of this feature is to offer users with an option where content based 
detection approach can be used, the content can be defined in several ways, 
they can be the entire file bytes, byte n-grams, byte histograms, etc. In this 
feature, the content byte histogram is used.  
+ The feature of TIKA-1582 is an extension of TIKA MIME detection based on file 
contents, i.e. the file byte histograms, and this feature follows a standard 
data mining process that extracts the knowledge out of the data (bytes). The 
motivation of this feature is to offer users with an option where content based 
detection approach can be used, the content can be defined in several ways, 
they can be the entire file bytes, byte n-grams, byte histograms, etc. In this 
feature, the content byte histogram is used.
  
  Some files are very huge in size, building byte histograms for those files 
requires significant amount of time, but it is worth noting that with domain 
specific knowledge or the heuristics (e.g. there might be some crucial and 
critical regions in the file that could help with the detection.), we can 
further reduce the amount of effort required for knowledge discovery or mining 
particular patterns that we can use in the type detection.
  
@@ -38, +30 @@

  
  Please also refer to the code repo for details of implementation for training 
or preparing for a model, the neural network and logistic regression learning 
are implemented in R and the following describes the pre processing and 
learning implementation in R.
  
- 
- 
- 
- 
  Project source repository
  
  https://github.com/LukeLiush/filetypeDetection
  
- 
- 
- 
- 
  The goal of the example is to be able to classify grb file types from non-grb 
types.
  
- 
- 
- 
- 
  '''Data preparation'''
- 
- 
- 
- 
  
  The positive training examples are collected from the AMD polar web sites 
(*.gsfc.nasa.gov). i.e. ftp://hydro1.sci.gsfc.nasa.gov/data/
  
  The negative training examples are collected from the following i.e. 
http://digitalcorpora.org/corp/files/govdocs1/zipfiles/
- 
- 
  
  Once GRB and non-GRB files are collected, the next step is to prepare our 
data set so as to allow R to easily manipulate.
  
@@ -88, +62 @@

  
  some files some bytes have higher frequencies whereas other bytes are less 
frequent, or in a critical situation, some files have only one or two bins that 
occupy the majority of the count, this makes a large gap between the most 
frequent and less frequent, the solution is to apply a companding function - A 
law or u law; square-rooting the bin values also provide the same effect, so by 
considering the computational cost, the square-root is chosen to enhance the 
histogram detail in place of A law or u law.
  
+ 
{{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}}
- 
- 
- 
- 
- 
- 
  
  A-law companding function curve
  
- 
- 
  Square-root function curve
- 
- 
- 
- 
- 
- 
  
  The following shows the difference
  
  Byte frequencies '''without''' any companding.
  
- 
- 
  Byte frequencies with A-law
- 
- 
  
  The following is the a-alaw formula implementation in R.
  
  alaw <- function(x, A=87.7){
  
-        th = 1/A
+  . th = 1/A
- 
-        cond1 <- (x>=0 && x < th)
+  cond1 <- (x>=0 && x < th)
- 
-        cond2 <- (x>=th && x <= 1)
+  cond2 <- (x>=th && x <= 1)
- 
-        x[cond1] <- A * abs(x[cond1]) / (1+log(A))
+  x[cond1] <- A * abs(x[cond1]) / (1+log(A))
- 
-        x[cond2] <- sign(x[cond2])*(1+log(A*abs(x[cond2])))/(1+log(A))
+  x[cond2] <- sign(x[cond2])*(1+log(A*abs(x[cond2])))/(1+log(A)) x
- 
-        x
  
  }
  
@@ -141, +92 @@

  
  For details of A-Law, please refer to 
http://en.wikipedia.org/wiki/A-law_algorithm.
  
- 
- 
- 
- 
  Byte frequencies with square root (power of 1/2)
- 
- 
- 
- 
- 
- 
  
  Byte frequencies with power of 1/3
  
@@ -163, +104 @@

  
  The Neural network can be seen as a function, in this case its input is  a 
vector of the preprocessed histogram and its output simply is a yes/no (1 or 
0); With neural network, we can actually have a probability that might tell how 
likely it believes a given input histogram is a GRB or non-GRB, again it is 
worth stressing that non-GRB is a huge class to be classified, we might need to 
have a s many negative training examples as possible, but again if we know what 
types we are dealing with, the problem might be further simplified with smaller 
set of classes; Also it is worthy noting, training with too many negative 
examples can also produce an unpromising result, in an extreme cases where you 
might have 10 positive examples and 10 million negative examples, with a huge 
different like this it is likely you might come up with a biased model towards 
the one that dump everything it has seen into negative class, so the choice of 
training set might be important, there are some cross-validation method that 
might help assuage this bias .e.g we can randomly pick some portion of negative 
training data, but again this leads to very thorough performance testing with 
each of the models you have trained with different data or even the training 
parameters such as a different regularized term, different structure (different 
layers and different number of weight units). In additions, the choice of the 
structure or the tuning parameters depend on how well the model fit the data, 
when it over-fits the training data, we might want to adjust the regularized 
terms or add more training data; but when it under-fits, we might also want to 
increase the complexity of the network structure, but again the choice of 
structure depends on the patterns hidden in the data.
  
- 
- 
  The linear logistic regression training seems to be less complex compared to 
neural network training, which can be implemented with svm, gradient descent, 
etc. It is a globally optimal solution as long as the data is linearly 
separable; and it is cheap in terms of computational complexity.
  
  '''Evaluation''':
@@ -173, +112 @@

  
  This is where the prepared test set is used. Again, the details of 
performance evaluation such as recall, precision, ROC, etc are skipped, but the 
idea is to decide whether our model meets our goals.
  
- 
- 
  '''Use of the knowledge'''
  
  ''''Output the model ''''
- 
- 
- 
- 
- 
- 
- 
- 
  
  When finishing neural network training, in the end the model parameters and 
configuration (e.g. number of input units, hidden units, etc) are written in a 
text file called ‘tika-example.nnmodel’ in the same directory with ‘main.R’;
  
  As we need to copy this file to Tika to allow Tika to detect the type for 
which the model is trained e.g. GRB type,  note you can create many models for 
many different mime types, but GRB file type detection is discussed and used as 
one example to demonstrate the use.
  
  The following line in main.R is the last line used to output the model, the 
name and structure can be customized according to different relish.
- 
-   
  
  The exportNNParams method implementation resides in the utility class i.e. 
‘myfunctions.R’; it can be also customized or replaced to create your own model 
file with different syntax or structure.
  
@@ -203, +130 @@

  
  The next line without # at the front shows a series of floating numbers 
separated by a tab, and they are model parameters, later we need to import the 
file into Tika and have the ExampleNNModelDetector to recreate the trained 
model with them in Tika so it can predict and classify the unseen file and 
determine with the imported model whether the given input file is a GRB or 
non-GRB type.
  
-   
- 
  The following shows the printing formation produced by the R program after 
training in a bit more detail with the outputted/chosen model above.
  
+  . [1] "Loading Dataset....." [1] "Begining Training Neural Networks" [1] 
"the length of weights 517" [1] "The time taken for training: 330.257000" [1] 
"The training error cost: 0.001380" [1] "The validation error cost: 0.025099" 
[1] "The testing error cost: 0.020883" [1] "Training Accuracy: 100.000000" [1] 
"Validation Accuracy: 99.650000" [1] "Testing Accuracy: 99.762349"
-   [1] "Loading Dataset....."
- 
-   [1] "Begining Training Neural Networks"
- 
-   [1] "the length of weights 517"
- 
-   [1] "The time taken for training: 330.257000"
- 
-   [1] "The training error cost: 0.001380"
- 
-   [1] "The validation error cost: 0.025099"
- 
-   [1] "The testing error cost: 0.020883"
- 
-   [1] "Training Accuracy: 100.000000"
- 
-   [1] "Validation Accuracy: 99.650000"
- 
-   [1] "Testing Accuracy: 99.762349"
- 
- 
- 
- 
  
  ''''Import the model into Tika ''''
- 
- 
  
  Once the training is done, there is a model file that is generated as 
mentioned above. The above model file only have one model, however you can have 
multiple models written in that file or you can have several model files 
according to your needs.
  
@@ -241, +142 @@

  
  It is also possible that your model might use different size of input of byte 
histograms, some might consider a different bin size with some heuristics 
specific to their own data, in that case, it is possible to overwrite the 
readByteFrequencies(final InputStream input) :: TrainedModelDetector by 
providing your own version of byte histograms, and you also need to ensure the 
model parameters are used and set to reflect the same size of input.
  
-   TrainedModelDetector implements the Detector interface, but it is abstract 
meaning we need to subclass it with our own version of TrainedModelDetector.
+  . TrainedModelDetector implements the Detector interface, but it is abstract 
meaning we need to subclass it with our own version of TrainedModelDetector.
+  ExampleNNModelDetector is its subclass, the purpose of subclassing the 
TrainedModelDetector is to supply the implementation of the method of 
loadDefaultModels that reads and registers the models into the model map 
<MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is 
populated with a set of mappings with keys and values, the detect method in the 
TrainedModelDetector will be able to use the loaded models to predict the mime 
types.
+  The job of the TrainedModelDetector is to convert the given input stream to 
byte frequency histogram and pass that as the input to the models that have 
been loaded or registered in the map.
+  There is also a TrainedModel(abstract) and its subclass NNTrainedModel.
+  The TrainedModel is an abstract class that represents an abstraction of a 
trained model; a model object must have a method of “predict” with input of 
byte histogram vector, it returns a probability of prediction. The following 
lists all of the classes for this feature (tika\tika-core\src\main\java)
+  org.apache.tika.detect.TrainedModelDetector (abstract) 
org.apache.tika.detect.ExampleNNModelDetector
+  org.apache.tika.detect.TrainedModel (abstract) 
org.apache.tika.detect.NNTrainedModel Example model file 
(tika\tika-core\src\main\resources) org.apache.tika.detect.tika-example.nnmodel 
Unit test (tika\tika-core\src\test\java)
+   . org.apache.tika.detect. MimeDetectionWithNNTest
  
-   ExampleNNModelDetector is its subclass, the purpose of subclassing the 
TrainedModelDetector is to supply the implementation of the method of 
loadDefaultModels that reads and registers the models into the model map 
<MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is 
populated with a set of mappings with keys and values, the detect method in the 
TrainedModelDetector will be able to use the loaded models to predict the mime 
types.
- 
-   The job of the TrainedModelDetector is to convert the given input stream to 
byte frequency histogram and pass that as the input to the models that have 
been loaded or registered in the map.
- 
-   There is also a TrainedModel(abstract) and its subclass NNTrainedModel.
- 
-   The TrainedModel is an abstract class that represents an abstraction of a 
trained model; a model object must have a method of “predict” with input of 
byte histogram vector, it returns a probability of prediction.
- 
-   The following lists all of the classes for this feature 
(tika\tika-core\src\main\java)
- 
-   org.apache.tika.detect.TrainedModelDetector (abstract)
- 
-   org.apache.tika.detect.ExampleNNModelDetector
- 
-   org.apache.tika.detect.TrainedModel (abstract)
- 
-   org.apache.tika.detect.NNTrainedModel
- 
-   Example model file (tika\tika-core\src\main\resources)
- 
-   org.apache.tika.detect.tika-example.nnmodel
- 
-   Unit test (tika\tika-core\src\test\java)
- 
-                   org.apache.tika.detect. MimeDetectionWithNNTest
-

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to