[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Mon, 11 May 2015 17:55:04 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=28&rev2=29

  
  Some files are very huge in size, building byte histograms for those files 
requires significant amount of time, but it is worth noting that with domain 
specific knowledge or the heuristics (e.g. there might be some crucial and 
critical regions in the file that could help with the detection.), we can 
further reduce the amount of effort required for knowledge discovery or mining 
particular patterns that we can use in the type detection.
  
- Please also note, this content based mime detection does require users to 
have some knowledge with data mining and machine learning, and the choice of 
learning algorithms used in the pattern mining does not seem to matter, the 
knowledge to be mined is the classification, and there are many classification 
learning algorithms invented or reinvented, the question of which one is the 
best depends on a goal and data, each of the learning algorithms requires lots 
of effort with thorough performance testing and emperical analysis, and some 
data might be linear separable, some are not; and a or a set of goals is very 
important as it often is in the context of performance tuning; we can also 
think about it as a performance tuning problem where we need to have a set of 
goals in terms of the scalability, complexity, accuracy, etc. And in order to 
set our goals, we might first need to understand our data, e.g. do we have 
enough data or what features do i need to use, do we need to transform input 
with high order function; and all those design question seem to matter the most 
and highly depend on the user-specific data, therefore we want to leave the 
choice of algorithms to users based on their goals and data in their 
environment.
+ Please also note, this content based mime detection does require users to 
have some knowledge with data mining and machine learning, and the choice of 
learning algorithms used in the pattern mining does not seem to matter, the 
knowledge to be mined is the classification, and there are many classification 
learning algorithms invented or reinvented, the question of which one is the 
best depends on a goal and data, each of the learning algorithms requires lots 
of effort with thorough performance testing and emperical analysis, and some 
data might be linear separable, some are not; and a or a set of goals is very 
important as it often is in the context of performance tuning; we can also 
think about it as a performance tuning problem where we need to have a set of 
goals in terms of the scalability, complexity, accuracy, etc. And in order to 
set our goals, we might first need to understand our data, e.g. do we have 
enough data or what features do i need to use, do we need to transform input; 
and all those design questions seem to matter the most and highly depend on the 
user-specific data and more importantly they largely affact the choice of the 
algorithms , therefore we want to leave the choice of algorithms to users based 
on their goals and data in their environment.
  
  As an example, we have actually implemented two algorithms for classifying 
the GRB file type from non-GRB types, one is linear logistic regression ( 
gradient descent) and the other is neural network (back-propagation). Again, 
the neural network with back-propagation is a bit more complex with training 
and slower too, whereas the logistic regression is far cheaper in terms of 
complexity; and with the collected GRB data in our tests, it turns out that 
logistic regression also gives a good result with high accuracy, and it is 
worthy noting that it is always better to circumscribe the mime types to be 
detected; the example model attempts to classify grb files from non-grb files, 
and one of the observed challenges is to identify the non-grb file types whose 
class can be enormously large, the best practice is again to circumscribe a set 
of types to be classified, and the domain specific knowledge come into the play 
for well-defining a set of types in the user specific environment.
  
@@ -54, +54 @@

  
  We need to split the dataset into 3 chunks, training set, validation set and 
test set.
  
- We convert the stream of bytes to the histogram with 255 bins each of which 
stores a count of occurances, [it is also flexible that you define your own 
input with smaller number of histogram bins or the selected bins based on the 
domain knowledge, you can also apply a feature selection algorithm such as SOM, 
PCA or LCA when the features space may be too huge (e.g. you might want to work 
with the entire bytes as input variables), and you can also transform the input 
variables with high-order function for the model to have non-linear effect, 
there are also many other practical tricks to achieve training a good model, 
but most of them might require a bit understanding with the application domain 
(i.e. in this case, the file types to be classified); To begin with, we 
probably need to understand our goal and the data  (domain if possible), 
usually we need to visualize the data and we start with some simple algorithms 
to explore the data and then decide whether a more complex algorithm or 
function is needed].
+ We convert the stream of bytes to the histogram with 255 bins each of which 
stores a count of occurances, [it is also flexible that you define your own 
input with smaller number of histogram bins or the selected bins based on the 
domain knowledge, you can also apply a feature selection algorithm such as SOM, 
PCA or LCA when the features space may be too huge (e.g. you might want to work 
with the entire bytes as input variables), and you can also transform the input 
variables with the custom function(or kernel with svm) for the model to have 
non-linear effect, there are also many other practical tricks to achieve 
training a good model, but most of them might require a bit understanding with 
the application domain (i.e. in this case, the file types to be classified); To 
begin with, we probably need to understand our goal and the data  (domain if 
possible), usually we need to visualize the data and we start with some simple 
algorithms to explore the data and then decide whether a more complex algorithm 
or function is needed].
  
  Our training data have the 255 features each of which corresponds to a byte, 
and each training example is labelled with an actual output indicating its 
class.

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to