[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Mon, 11 May 2015 17:47:37 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=27&rev2=28

  
  Some files are very huge in size, building byte histograms for those files 
requires significant amount of time, but it is worth noting that with domain 
specific knowledge or the heuristics (e.g. there might be some crucial and 
critical regions in the file that could help with the detection.), we can 
further reduce the amount of effort required for knowledge discovery or mining 
particular patterns that we can use in the type detection.
  
- Please also note, this content based mime detection does require users to 
have some knowledge with data mining and machine learning, and the choice of 
learning algorithms used in the pattern mining does not seem to matter, the 
knowledge to be mined is the classification, and there are many classification 
learning algorithms invented or reinvented, the question of which one is the 
best depends on a goal and data, each of the learning algorithms requires lots 
of effort specifically with thorough performance tuning testing and emperical 
analysis, and some data might be linear separable, some are not; and a or a set 
of goals is very important as it often is in the context of performance tuning; 
we can also think about it as a performance tuning problem where we need to 
have a set of goals in terms of the scalability, complexity or accuracy, so we 
want to leave the choice of algorithms to users based on their goals and data 
in their environment. As an example, we have actually implemented two 
algorithms for classifying the GRB file type from non-GRB types, one is linear 
logistic regression ( gradient descent) and the other is neural network 
(backpropagation). Again, the neural network with back-propagation is a bit 
more complex with training, and the logistic regression is far cheaper in terms 
of complexity, and with the collected GRB data, it turns out that logistic 
regression also gives a good result with high accuracy, and it is worthy noting 
that it is always better to circumscribe the mime types to be detected; the 
example model attempts to classify grb files from non-grb files, and one of the 
observed challenges is to identify the non-grb file types whose class can be 
enormously large, the best practice is again to circumscribe a set of types to 
be classified, and the domain specific knowledge come into the play for 
well-defining a set of types in the user specific environment.
+ Please also note, this content based mime detection does require users to 
have some knowledge with data mining and machine learning, and the choice of 
learning algorithms used in the pattern mining does not seem to matter, the 
knowledge to be mined is the classification, and there are many classification 
learning algorithms invented or reinvented, the question of which one is the 
best depends on a goal and data, each of the learning algorithms requires lots 
of effort with thorough performance testing and emperical analysis, and some 
data might be linear separable, some are not; and a or a set of goals is very 
important as it often is in the context of performance tuning; we can also 
think about it as a performance tuning problem where we need to have a set of 
goals in terms of the scalability, complexity, accuracy, etc. And in order to 
set our goals, we might first need to understand our data, e.g. do we have 
enough data or what features do i need to use, do we need to transform input 
with high order function; and all those design question seem to matter the most 
and highly depend on the user-specific data, therefore we want to leave the 
choice of algorithms to users based on their goals and data in their 
environment.
+ 
+ As an example, we have actually implemented two algorithms for classifying 
the GRB file type from non-GRB types, one is linear logistic regression ( 
gradient descent) and the other is neural network (back-propagation). Again, 
the neural network with back-propagation is a bit more complex with training 
and slower too, whereas the logistic regression is far cheaper in terms of 
complexity; and with the collected GRB data in our tests, it turns out that 
logistic regression also gives a good result with high accuracy, and it is 
worthy noting that it is always better to circumscribe the mime types to be 
detected; the example model attempts to classify grb files from non-grb files, 
and one of the observed challenges is to identify the non-grb file types whose 
class can be enormously large, the best practice is again to circumscribe a set 
of types to be classified, and the domain specific knowledge come into the play 
for well-defining a set of types in the user specific environment.
  
  This feature could also enhance identification safety, so it only trusts the 
files that have similar byte histogram patterns it has seen in its training 
set, this has pros and cons, one of the pros as mentioned is that it enhance 
the security aspect of the MIME type identification, but the cons is slow 
detection which requires the reading the entire bytes of a file for computing 
the byte histogram and it might be also myopic to the training data which might 
be biased or less representative.
  
@@ -52, +54 @@

  
  We need to split the dataset into 3 chunks, training set, validation set and 
test set.
  
- We convert the stream of bytes to the histogram with 255 bins each of which 
stores a count of occurances, [you can define your input with smaller number of 
histogram bins or the selected bins based on the domain knowledge, you can also 
apply a feature selection algorithm such as SOM, PCA or LCA when the features 
space is too huge (e.g. you might want to work with the entire bytes as the 
input), and you can also apply your own custom functions such as power or sqrt 
on the input variables for the model to have non-linear effect, there are also 
many other practical tricks to achieve training a good model, but most of them 
might require a bit understanding with the application domain (i.e. in this 
case, the file types to be classified); To begin with, we probably need to 
understand our goal and the data  (domain if possible), usually we need to 
visualize the data and we start with some simple algorithms to explore the data 
and then decide whether a more complex algorithm or function is needed].
+ We convert the stream of bytes to the histogram with 255 bins each of which 
stores a count of occurances, [it is also flexible that you define your own 
input with smaller number of histogram bins or the selected bins based on the 
domain knowledge, you can also apply a feature selection algorithm such as SOM, 
PCA or LCA when the features space may be too huge (e.g. you might want to work 
with the entire bytes as input variables), and you can also transform the input 
variables with high-order function for the model to have non-linear effect, 
there are also many other practical tricks to achieve training a good model, 
but most of them might require a bit understanding with the application domain 
(i.e. in this case, the file types to be classified); To begin with, we 
probably need to understand our goal and the data  (domain if possible), 
usually we need to visualize the data and we start with some simple algorithms 
to explore the data and then decide whether a more complex algorithm or 
function is needed].
  
  Our training data have the 255 features each of which corresponds to a byte, 
and each training example is labelled with an actual output indicating its 
class.

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to