[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Sun, 10 May 2015 13:33:06 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=3&rev2=4

  
  The feature of TIKA-1582 is an extension of TIKA MIME detection based on file 
contents, i.e. the file byte histograms, and this feature follows a standard 
data mining process that extracts the knowledge out of the data (bytes). The 
motivation of this feature is to offer users with an option where content based 
detection approach can be used, the content can be defined in several ways, 
they can be the entire file bytes, byte n-grams, byte histograms, etc. In this 
feature, the content byte histogram is used.  
  
- Some files are very huge in size, building byte histograms for those files 
requires significant amount time, but it is worth noting that with domain 
specific knowledge or the heuristics (e.g. there might be some crucial and 
critical regions in the file that could help with the detection.), we can 
further reduce the amount of effort required for knowledge discovery or mining 
particular patterns that we can use in the type detection.
+ Some files are very huge in size, building byte histograms for those files 
requires significant amount of time, but it is worth noting that with domain 
specific knowledge or the heuristics (e.g. there might be some crucial and 
critical regions in the file that could help with the detection.), we can 
further reduce the amount of effort required for knowledge discovery or mining 
particular patterns that we can use in the type detection.
  
- Please also note, this content based mime detection does require users to 
have some knowledge with data mining and machine learning, and the choice of 
learning algorithms used in the pattern mining does not seem to matter, the 
knowledge to be mined is actually the classification, and there are many 
classification learning algorithms invented or revented, the question of which 
one is the best depends on a goal and data, each of the learning algorithms 
requires lots of effort for performance testing, and some data might be linear 
sepeartable, some are not; and a or a set of goals is very important as it 
often is in the context of performance tuning; we can also think about it as a 
performance tuning problem where we need to set a set of goals in terms of the 
scalability, complexity or accuracy, so we want to leave the choice of 
algorithms to users based on their goals and data in their enviroment. As an 
example, we have actually implemented two algorithms for mining patterns with 
the GRB file types, one is linear logistic regression and the other is neural 
network. Again, the neural network with back-propagation is a bit more complex 
with training, and logistic regression is far cheaper in terms of complexity, 
and it turns out that logistic regression also gives a good result with high 
accuracy, and it is worthy noting that it is always better to circumscribe the 
mime types to be detected; in the example model we have built, we attempt to 
classify grb files from non-grb files, and one of the challenges is to identify 
the non-grb file types whose class can be enormously large, the best practice 
is to circumscribe a set of types to be classified, again domain specific 
knowledge come into the play for well-defining a set of types in the user 
specific enviroment.
+ Please also note, this content based mime detection does require users to 
have some knowledge with data mining and machine learning, and the choice of 
learning algorithms used in the pattern mining does not seem to matter, the 
knowledge to be mined is actually the classification, and there are many 
classification learning algorithms invented or reinvented, the question of 
which one is the best depends on a goal and data, each of the learning 
algorithms requires lots of effort for performance testing, and some data might 
be linear separable, some are not; and a or a set of goals is very important as 
it often is in the context of performance tuning; we can also think about it as 
a performance tuning problem where we need to set a set of goals in terms of 
the scalability, complexity or accuracy, so we want to leave the choice of 
algorithms to users based on their goals and data in their environment. As an 
example, we have actually implemented two algorithms for mining patterns with 
the GRB file types, one is linear logistic regression and the other is neural 
network. Again, the neural network with back-propagation is a bit more complex 
with training, and logistic regression is far cheaper in terms of complexity, 
and it turns out that logistic regression also gives a good result with high 
accuracy, and it is worthy noting that it is always better to circumscribe the 
mime types to be detected; in the example model we have built, we attempt to 
classify grb files from non-grb files, and one of the challenges is to identify 
the non-grb file types whose class can be enormously large, the best practice 
is to circumscribe a set of types to be classified, again domain specific 
knowledge come into the play for well-defining a set of types in the user 
specific environment.
  
- This approach could also enhance identification safty, so it only trusts the 
file with the type which has the similar byte histogram pattern it has seen in 
the training, this has pros and cons, the pros is that it enhance the security 
aspect of the file type identification, but the cons is slow detection which 
requires the reading the entire bytes of a file for computing the byte 
histogram and it might be also myopic to the training data which might be less 
representative.
+ This approach could also enhance identification safety, so it only trusts the 
file with the type which has the similar byte histogram pattern it has seen in 
the training, this has pros and cons, the pros is that it enhance the security 
aspect of the file type identification, but the cons is slow detection which 
requires the reading the entire bytes of a file for computing the byte 
histogram and it might be also myopic to the training data which might be less 
representative.
  
  Methods:
  
  As mentioned, the content-based mime detection follows a standard data mining 
process:
  
- Raw data - > feature selection and data cleaning -> preprocessing and 
transformation -> learning patterns(machine learning) -> knowledge evaluation 
-> the use of knowledge(prediction/classification) In Tika.
+ Raw data - > feature selection and data cleaning -> pre-processing and 
transformation -> learning patterns(machine learning) -> knowledge evaluation 
-> the use of knowledge(prediction/classification) In TIKA.
  
- (It is worth noting that the feature selection requires learning the 
application domain which in our case is specific to the user domain and 
enviroment)
+ (It is worth noting that the feature selection requires learning the 
application domain which in our case is specific to the user domain and 
environment)
  
- Also please note the model has to be ready befored it can be used in Tika; by 
"ready", we mean the model has to pass the final knowledge evaluation test. As 
shall be seen shortly, as an example Tika is only implementing the prediction 
phase, so the model parameters need to be loaded and read into tika for 
prediction or classification; The process of training can be lengthy and 
tedious, sometimes training might require parallel computation on e.g. 
map-reduce when training data is too large to fit memory, again this depends on 
the user's goal.
+ Also please note the model has to be ready before it can be used in Tika; by 
"ready", we mean the model has to pass the final knowledge evaluation test. As 
shall be seen shortly, as an example Tika is only implementing the prediction 
phase, so the model parameters need to be loaded and read into Tika for 
prediction or classification; The process of training can be lengthy and 
tedious, sometimes training might require parallel computation on e.g. 
map-reduce when training data is too large to fit memory, again this depends on 
the user's goal.
  
  ''The following will briefly walk you through how the feature and example is 
implemented in this data problem.''
  
- Please also refer to the code repo for details of implemenation for training 
or preparing for a model, the neural network and logistic regression learning 
are implemented in R and the following describes the preprocessing and learning 
implemenation in R.
+ Please also refer to the code repo for details of implementation for training 
or preparing for a model, the neural network and logistic regression learning 
are implemented in R and the following describes the pre processing and 
learning implementation in R.
  
  
  
@@ -80, +80 @@

  
  All of the sets are treated as matrices which need to be saved as files; 
those files are loaded into the R program thru the ‘loadAndProcess.R’;
  
- '''Preprocessing'''
+ '''Pre processing'''
  
  1)      Read byte content of the file build byte histogram.
  
  Build frequency by dividing each bin value with the max count of occurrence 
to have each bin value to fall in the range between 0 and 1.
  
- as isome files some bytes have higher frequencies whereas other bytes are 
less frequent, or in a critical situation, some files have only one or two bins 
that occupy the majority of the count, this makes a large gap between the most 
frequent and less frequent, the solution is to apply a companding function - A 
law or u law; square-rooting the bin values also provide the same effect, so by 
considering the computational cost, the square-root is chosen to enhance the 
histogram detail in place of A law or u law.
+ some files some bytes have higher frequencies whereas other bytes are less 
frequent, or in a critical situation, some files have only one or two bins that 
occupy the majority of the count, this makes a large gap between the most 
frequent and less frequent, the solution is to apply a companding function - A 
law or u law; square-rooting the bin values also provide the same effect, so by 
considering the computational cost, the square-root is chosen to enhance the 
histogram detail in place of A law or u law.
  
  
  
@@ -167, +167 @@

  
  Once we have preprocessed our inputs, i.e. byte histograms, we are then ready 
to train a model with a machine learning algorithm.
  
- The Neural network can be seen as a function, in this case its input is  a 
vector of the preprocessed histogram and its output simply is a yes/no (1 or 
0); With neural network, we can actually have a probability that might tell how 
likely it believes a given input histogram is a GRB or non-GRB, again it is 
worth stressing that non-GRB is a huge class to be classified, we might need to 
have a s many negative training examples as possible, but again if we know what 
types we are dealing with, the problem might be further simplified with smaller 
set of classes; Also it is worthy noting, training with too many negative 
examples can also produce an unpromising result, in an extreme cases where you 
might have 10 positve examples and 10 million negative examples, with a huge 
different like this it is likely you might come up with a biased model towards 
the one that dump everything it has seen into negative class, so the choice of 
training set might be important, there are some cross-validation method that 
might help assuage this bias .e.g we can randomly pick some portion of negative 
training data, but again this leads to very thorough performance testing with 
each of the models you have trained with different data or even the training 
parameters such as a different regularized term, different structure (different 
layers and different number of weight units). In addtions, the choice of the 
structure or the tuning parameters depend on how well the model fit the data, 
when it overfits the training data, we might want to adjust the regularized 
terms or add more training data; but when it underfits, we might also want to 
increase the complexity of the network structure, but again the choice of 
strucutre depends on the patterns hidden in the data.
+ The Neural network can be seen as a function, in this case its input is  a 
vector of the preprocessed histogram and its output simply is a yes/no (1 or 
0); With neural network, we can actually have a probability that might tell how 
likely it believes a given input histogram is a GRB or non-GRB, again it is 
worth stressing that non-GRB is a huge class to be classified, we might need to 
have a s many negative training examples as possible, but again if we know what 
types we are dealing with, the problem might be further simplified with smaller 
set of classes; Also it is worthy noting, training with too many negative 
examples can also produce an unpromising result, in an extreme cases where you 
might have 10 positive examples and 10 million negative examples, with a huge 
different like this it is likely you might come up with a biased model towards 
the one that dump everything it has seen into negative class, so the choice of 
training set might be important, there are some cross-validation method that 
might help assuage this bias .e.g we can randomly pick some portion of negative 
training data, but again this leads to very thorough performance testing with 
each of the models you have trained with different data or even the training 
parameters such as a different regularized term, different structure (different 
layers and different number of weight units). In additions, the choice of the 
structure or the tuning parameters depend on how well the model fit the data, 
when it over-fits the training data, we might want to adjust the regularized 
terms or add more training data; but when it under-fits, we might also want to 
increase the complexity of the network structure, but again the choice of 
structure depends on the patterns hidden in the data.
  
  
  
- The linear logistic regression training seems to be less complex compared to 
neural network training, which can be implemented with svm, gradient descent, 
etc. It is a globably optimal solution as long as the data is linearly 
seperable; and it is cheap in terms of computational complexity.
+ The linear logistic regression training seems to be less complex compared to 
neural network training, which can be implemented with svm, gradient descent, 
etc. It is a globally optimal solution as long as the data is linearly 
separable; and it is cheap in terms of computational complexity.
  
  '''Evaluation''':
  
- Once we finish training, we need to score our model and decide whether the 
model meets our goal, then the knowledege Evaluation is also very significant 
in the process,
+ Once we finish training, we need to score our model and decide whether the 
model meets our goal, then the knowledge Evaluation is also very significant in 
the process,
  
  This is where the prepared test set is used. Again, the details of 
performance evaluation such as recall, precision, ROC, etc are skipped, but the 
idea is to decide whether our model meets our goals.

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to