[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Mon, 11 May 2015 00:07:15 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=26&rev2=27

  
  Once we have preprocessed our inputs, i.e. byte histograms, we are then ready 
to train a model with a machine learning algorithm.
  
- The Neural network can be seen as a function, in this case its input is  a 
vector of the preprocessed histogram and its output simply is a yes/no (1 or 
0); With neural network, we can actually have a probability that might tell how 
likely it believes a given input histogram is a GRB or non-GRB, again it is 
worth stressing that non-GRB is a huge class to be classified, we might need to 
have a s many negative training examples as possible, but again if we know what 
types we are dealing with, the problem might be further simplified with smaller 
set of classes; Also it is worthy noting, training with too many negative 
examples can also produce an unpromising result, in an extreme cases where you 
might have 10 positive examples and 10 million negative examples, with that 
huge difference it is likely you might have a biased model towards the one that 
dump everything it has seen into negative class, so the choice of training set 
might be important, there are some cross-validation method that might help 
assuage this bias .e.g we can randomly pick some portion of negative training 
data, but again this leads to very thorough performance testing with each of 
the models you have trained with different data or even the training parameters 
such as a different regularized term, different structure (different layers and 
different number of weight units). In additions, the choice of the structure or 
the tuning parameters depend on how well the model fit the data, when it 
over-fits the training data, we might want to adjust the regularized terms or 
add more training data; but when it under-fits, we might also want to increase 
the complexity of the network structure, but again the choice of structure 
depends on the patterns hidden in the data.
+ The Neural network can be seen as a function, in this case its input is  a 
vector of the preprocessed histogram and its output simply is a yes/no (1 or 
0); With neural network, we can actually have a probability that might tell how 
likely it believes a given input histogram is a GRB or non-GRB, again it is 
worth stressing that non-GRB is a huge class to be classified, we might need to 
have a s many negative training examples as possible, but again if we know what 
types we are dealing with, the problem might be further simplified with smaller 
set of classes; Also it is worthy noting, training with too many negative 
examples can also produce an unpromising result, in an extreme cases where you 
might have 10 positive examples and 10 million negative examples, with that 
huge difference it is likely you might have a biased model towards the one that 
dump everything it has seen into the negative class, so the choice of training 
set might be important, there are some cross-validation methods that might help 
assuage this bias .e.g we can randomly pick some portion of negative training 
data, but again the thorough performance testing is needed with each of the 
models you have trained based on the different data and the training parameters 
(e.g. a different regularized term, different network structure, etc). In 
additions, the choice of the structure or the tuning parameters depend on how 
well the model fit the data, when it over-fits the training data, we might want 
to adjust the regularized terms or add more training data; but when it 
under-fits, we might also want to increase the complexity of the network 
structure, but again the choice of structure depends on the patterns hidden in 
the data.
  
- The linear logistic regression training seems to be less complex compared to 
neural network training, which can be implemented with svm, gradient descent, 
etc. It is a globally optimal solution as long as the data is linearly 
separable; and it is cheap in terms of computational complexity.
+ Training a linear logistic regression model seems to be far less complex 
compared to the neural network , a linear logistic regression can be 
implemented with svm, gradient descent, etc. It is a globally optimal solution 
as long as the data is linearly separable; and it is cheap in terms of 
computational complexity which is traded for accuracy; Again the choice of this 
algorrithm depends on the data; In the tests with the collected GRB training 
data, the trained logistic regression model also seem to generalize well with 
reasonably high accuracy.
  
  '''Evaluation''':
  
+ Once we finish training, we need to score our model and decide whether the 
model meets our goal, so the knowledge Evaluation is also very significant in 
the process,this is where the prepared test set is used. Again, the details of 
performance evaluation such as recall, precision, ROC, etc are skipped, but the 
idea is to decide whether our model meets our goals.
- Once we finish training, we need to score our model and decide whether the 
model meets our goal, then the knowledge Evaluation is also very significant in 
the process,
- 
- This is where the prepared test set is used. Again, the details of 
performance evaluation such as recall, precision, ROC, etc are skipped, but the 
idea is to decide whether our model meets our goals.
  
  '''Use of the knowledge'''

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to