[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Sun, 10 May 2015 15:20:06 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=17&rev2=18

  
  Some files are very huge in size, building byte histograms for those files 
requires significant amount of time, but it is worth noting that with domain 
specific knowledge or the heuristics (e.g. there might be some crucial and 
critical regions in the file that could help with the detection.), we can 
further reduce the amount of effort required for knowledge discovery or mining 
particular patterns that we can use in the type detection.
  
- Please also note, this content based mime detection does require users to 
have some knowledge with data mining and machine learning, and the choice of 
learning algorithms used in the pattern mining does not seem to matter, the 
knowledge to be mined is actually the classification, and there are many 
classification learning algorithms invented or reinvented, the question of 
which one is the best depends on a goal and data, each of the learning 
algorithms requires lots of effort for performance testing, and some data might 
be linear separable, some are not; and a or a set of goals is very important as 
it often is in the context of performance tuning; we can also think about it as 
a performance tuning problem where we need to set a set of goals in terms of 
the scalability, complexity or accuracy, so we want to leave the choice of 
algorithms to users based on their goals and data in their environment. As an 
example, we have actually implemented two algorithms for mining patterns with 
the GRB file types, one is linear logistic regression and the other is neural 
network. Again, the neural network with back-propagation is a bit more complex 
with training, and logistic regression is far cheaper in terms of complexity, 
and it turns out that logistic regression also gives a good result with high 
accuracy, and it is worthy noting that it is always better to circumscribe the 
mime types to be detected; in the example model we have built, we attempt to 
classify grb files from non-grb files, and one of the challenges is to identify 
the non-grb file types whose class can be enormously large, the best practice 
is to circumscribe a set of types to be classified, again domain specific 
knowledge come into the play for well-defining a set of types in the user 
specific environment.
+ Please also note, this content based mime detection does require users to 
have some knowledge with data mining and machine learning, and the choice of 
learning algorithms used in the pattern mining does not seem to matter, the 
knowledge to be mined is actually the classification, and there are many 
classification learning algorithms invented or reinvented, the question of 
which one is the best depends on a goal and data, each of the learning 
algorithms requires lots of effort for performance testing, and some data might 
be linear separable, some are not; and a or a set of goals is very important as 
it often is in the context of performance tuning; we can also think about it as 
a performance tuning problem where we need to set a set of goals in terms of 
the scalability, complexity or accuracy, so we want to leave the choice of 
algorithms to users based on their goals and data in their environment. As an 
example, we have actually implemented two algorithms for classifying the GRB 
file type from non-GRB types, one is linear logistic regression and the other 
is neural network. Again, the neural network with back-propagation is a bit 
more complex with training, and the logistic regression is far cheaper in terms 
of complexity, and it turns out that logistic regression also gives a good 
result with high accuracy, and it is worthy noting that it is always better to 
circumscribe the mime types to be detected; in the example model we have built, 
we attempt to classify grb files from non-grb files, and one of the challenges 
is to identify the non-grb file types whose class can be enormously large, the 
best practice is again to circumscribe a set of types to be classified, and the 
domain specific knowledge come into the play for well-defining a set of types 
in the user specific environment.
  
- This approach could also enhance identification safety, so it only trusts the 
file with the type which has the similar byte histogram pattern it has seen in 
the training, this has pros and cons, the pros is that it enhance the security 
aspect of the file type identification, but the cons is slow detection which 
requires the reading the entire bytes of a file for computing the byte 
histogram and it might be also myopic to the training data which might be less 
representative.
+ This feature could also enhance identification safety, so it only trusts the 
file with the type which has the similar byte histogram pattern it has seen in 
the training set, this has pros and cons, one of the pros as mentioned is that 
it enhance the security aspect of the file type identification, but the cons is 
slow detection which requires the reading the entire bytes of a file for 
computing the byte histogram and it might be also myopic to the training data 
which might be biased or less representative.
  
  '''Methods'''
  
@@ -21, +21 @@

  
  Raw data - > feature selection and data cleaning -> pre-processing and 
transformation -> learning patterns(machine learning) -> knowledge evaluation 
-> the use of knowledge(prediction/classification) In TIKA.
  
- (It is worth noting that the feature selection requires learning the 
application domain which in our case is specific to the user domain and 
environment)
+ (It is worth noting that the feature selection requires learning the 
application domain)
  
- Also please note the model has to be ready before it can be used in Tika; by 
"ready", we mean the model has to pass the final knowledge evaluation test. As 
shall be seen shortly, as an example Tika is only implementing the prediction 
phase, so the model parameters need to be loaded and read into Tika for 
prediction or classification; The process of training can be lengthy and 
tedious, sometimes training might need to be converted to map-reduce operations 
when training data is too large to fit memory, again this depends on the user's 
goal.
+ Also please note the model has to be ready before it can be used in Tika; by 
"ready", we mean the model has to pass the final knowledge evaluation test. As 
shall be seen shortly, as an example Tika is only implementing the prediction 
phase, so the model parameters need to be loaded and read into Tika for 
prediction or classification; The process of training can be lengthy and 
tedious, sometimes training might need to be converted to parallel/map-reduce 
operations when training data is too large to fit memory, again this depends on 
the user's goal.
  
  __''The following will briefly walk you through how the feature and example 
is implemented in this data problem. Please also refer to the attached docx for 
further information with the implemenation in R.''__
  
- Please also refer to the code repo for details of implementation for training 
a model, the neural network and logistic regression learning are all 
implemented in R and the following briefly describes the pre-processing and 
learning implementation in R and how to load the model parameters trained from 
the R programs into the Tika for mime detection.
+ Please also refer to the code repo for details of the implementation for 
training a model, the neural network and logistic regression learning are all 
implemented in R and the following briefly describes the pre-processing and 
learning implementation in R and how to load the model parameters trained from 
the R programs into the Tika for mime detection.
  
- Please note again, the training program can be created or written in any 
programming language, the R implemenation is posted as an example, Tika only 
needs to load the well-trained model parameters from the training program and 
be able to use them. The job of the feature in Tika generally have 4 steps as 
follow, and also it is flexible that you can overwrite the detect method of the 
TrainedModelDetector to define your own selected features if you have different 
features defined in your training.
+ The training program can be created or written in any programming language, 
the R implemenation is posted as an example, Tika only needs to load the 
well-trained model parameters from the training program and be able to use 
them. The job of the feature in Tika generally have 4 steps as follow, and also 
it is flexible that you can overwrite the detect method of the 
TrainedModelDetector to define your own selected features if you have different 
features defined for training.
  
   1. read the input in bytes
   1. convert it to the byte histogram
@@ -40, +40 @@

  
  https://github.com/LukeLiush/filetypeDetection
  
- The goal of the example is to be able to classify grb file types from non-grb 
types.
+ The goal in the example model is to be able to classify GRB file types from 
non-GRB types.
  
  '''Data preparation'''
  
@@ -177, +177 @@

  
  The following lists all of the classes for this feature 
(tika\tika-core\src\main\java)
  
- org.apache.tika.detect.TrainedModelDetector (abstract)
+   org.apache.tika.detect.TrainedModelDetector (abstract)
  
- org.apache.tika.detect.ExampleNNModelDetector
+   org.apache.tika.detect.ExampleNNModelDetector
  
- org.apache.tika.detect.TrainedModel (abstract)
+   org.apache.tika.detect.TrainedModel (abstract)
  
- org.apache.tika.detect.NNTrainedModel
+   org.apache.tika.detect.NNTrainedModel
  
  Example model file (tika\tika-core\src\main\resources)
  
- org.apache.tika.detect.tika-example.nnmodel
+   org.apache.tika.detect.tika-example.nnmodel
  
  Unit test (tika\tika-core\src\test\java)
  
-  . org.apache.tika.detect. MimeDetectionWithNNTest
+   org.apache.tika.detect. MimeDetectionWithNNTest

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to