[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Sun, 10 May 2015 23:56:09 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=25&rev2=26

  
  Once we have preprocessed our inputs, i.e. byte histograms, we are then ready 
to train a model with a machine learning algorithm.
  
- The Neural network can be seen as a function, in this case its input is  a 
vector of the preprocessed histogram and its output simply is a yes/no (1 or 
0); With neural network, we can actually have a probability that might tell how 
likely it believes a given input histogram is a GRB or non-GRB, again it is 
worth stressing that non-GRB is a huge class to be classified, we might need to 
have a s many negative training examples as possible, but again if we know what 
types we are dealing with, the problem might be further simplified with smaller 
set of classes; Also it is worthy noting, training with too many negative 
examples can also produce an unpromising result, in an extreme cases where you 
might have 10 positive examples and 10 million negative examples, with a huge 
different like this it is likely you might come up with a biased model towards 
the one that dump everything it has seen into negative class, so the choice of 
training set might be important, there are some cross-validation method that 
might help assuage this bias .e.g we can randomly pick some portion of negative 
training data, but again this leads to very thorough performance testing with 
each of the models you have trained with different data or even the training 
parameters such as a different regularized term, different structure (different 
layers and different number of weight units). In additions, the choice of the 
structure or the tuning parameters depend on how well the model fit the data, 
when it over-fits the training data, we might want to adjust the regularized 
terms or add more training data; but when it under-fits, we might also want to 
increase the complexity of the network structure, but again the choice of 
structure depends on the patterns hidden in the data.
+ The Neural network can be seen as a function, in this case its input is  a 
vector of the preprocessed histogram and its output simply is a yes/no (1 or 
0); With neural network, we can actually have a probability that might tell how 
likely it believes a given input histogram is a GRB or non-GRB, again it is 
worth stressing that non-GRB is a huge class to be classified, we might need to 
have a s many negative training examples as possible, but again if we know what 
types we are dealing with, the problem might be further simplified with smaller 
set of classes; Also it is worthy noting, training with too many negative 
examples can also produce an unpromising result, in an extreme cases where you 
might have 10 positive examples and 10 million negative examples, with that 
huge difference it is likely you might have a biased model towards the one that 
dump everything it has seen into negative class, so the choice of training set 
might be important, there are some cross-validation method that might help 
assuage this bias .e.g we can randomly pick some portion of negative training 
data, but again this leads to very thorough performance testing with each of 
the models you have trained with different data or even the training parameters 
such as a different regularized term, different structure (different layers and 
different number of weight units). In additions, the choice of the structure or 
the tuning parameters depend on how well the model fit the data, when it 
over-fits the training data, we might want to adjust the regularized terms or 
add more training data; but when it under-fits, we might also want to increase 
the complexity of the network structure, but again the choice of structure 
depends on the patterns hidden in the data.
  
  The linear logistic regression training seems to be less complex compared to 
neural network training, which can be implemented with svm, gradient descent, 
etc. It is a globally optimal solution as long as the data is linearly 
separable; and it is cheap in terms of computational complexity.

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to