[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Sun, 10 May 2015 15:42:47 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=20&rev2=21

  
  We need to split the dataset into 3 chunks, training set, validation set and 
test set.
  
- We convert the stream of bytes to the histogram with 255 bins each of which 
stores a count of occurances, [you can define your input with smaller number of 
histogram bins or the selected bins based on the domain knowledge, you can also 
apply a feature selection algorithm such as SOM, PCA or LCA when the features 
space is too huge (e.g. you might want to work with the entire bytes as the 
input), and you can also apply own custom functions on the input variables for 
the model to have non-linear effect, but to begin with, we need to understand 
our goal and the data, usually we need to visualize the data and we start with 
some simple algorithms to explore the data and then decide whether a more 
complex algorithm is needed].
+ We convert the stream of bytes to the histogram with 255 bins each of which 
stores a count of occurances, [you can define your input with smaller number of 
histogram bins or the selected bins based on the domain knowledge, you can also 
apply a feature selection algorithm such as SOM, PCA or LCA when the features 
space is too huge (e.g. you might want to work with the entire bytes as the 
input), and you can also apply your own custom functions such as power or sqrt 
on the input variables for the model to have non-linear effect, there are also 
many other practical tricks to achieve training a good model, but most of them 
might require a bit understanding with the application domain (i.e. in this 
case, the file types to be classified); To begin with, we probably need to 
understand our goal and the data  (domain if possible), usually we need to 
visualize the data and we start with some simple algorithms to explore the data 
and then decide whether a more complex algorithm or function is needed].
  
- Our training data have the 255 features each of which correspond to a byte, 
and each training example is labelled with an actual output indicating its 
class.
+ Our training data have the 255 features each of which corresponds to a byte, 
and each training example is labelled with an actual output indicating its 
class.
  
  '''Pre processing'''

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to