Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ContentMimeDetection" page has been changed by Lukeliush: https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=25&rev2=26 Once we have preprocessed our inputs, i.e. byte histograms, we are then ready to train a model with a machine learning algorithm. - The Neural network can be seen as a function, in this case its input is a vector of the preprocessed histogram and its output simply is a yes/no (1 or 0); With neural network, we can actually have a probability that might tell how likely it believes a given input histogram is a GRB or non-GRB, again it is worth stressing that non-GRB is a huge class to be classified, we might need to have a s many negative training examples as possible, but again if we know what types we are dealing with, the problem might be further simplified with smaller set of classes; Also it is worthy noting, training with too many negative examples can also produce an unpromising result, in an extreme cases where you might have 10 positive examples and 10 million negative examples, with a huge different like this it is likely you might come up with a biased model towards the one that dump everything it has seen into negative class, so the choice of training set might be important, there are some cross-validation method that might help assuage this bias .e.g we can randomly pick some portion of negative training data, but again this leads to very thorough performance testing with each of the models you have trained with different data or even the training parameters such as a different regularized term, different structure (different layers and different number of weight units). In additions, the choice of the structure or the tuning parameters depend on how well the model fit the data, when it over-fits the training data, we might want to adjust the regularized terms or add more training data; but when it under-fits, we might also want to increase the complexity of the network structure, but again the choice of structure depends on the patterns hidden in the data. + The Neural network can be seen as a function, in this case its input is a vector of the preprocessed histogram and its output simply is a yes/no (1 or 0); With neural network, we can actually have a probability that might tell how likely it believes a given input histogram is a GRB or non-GRB, again it is worth stressing that non-GRB is a huge class to be classified, we might need to have a s many negative training examples as possible, but again if we know what types we are dealing with, the problem might be further simplified with smaller set of classes; Also it is worthy noting, training with too many negative examples can also produce an unpromising result, in an extreme cases where you might have 10 positive examples and 10 million negative examples, with that huge difference it is likely you might have a biased model towards the one that dump everything it has seen into negative class, so the choice of training set might be important, there are some cross-validation method that might help assuage this bias .e.g we can randomly pick some portion of negative training data, but again this leads to very thorough performance testing with each of the models you have trained with different data or even the training parameters such as a different regularized term, different structure (different layers and different number of weight units). In additions, the choice of the structure or the tuning parameters depend on how well the model fit the data, when it over-fits the training data, we might want to adjust the regularized terms or add more training data; but when it under-fits, we might also want to increase the complexity of the network structure, but again the choice of structure depends on the patterns hidden in the data. The linear logistic regression training seems to be less complex compared to neural network training, which can be implemented with svm, gradient descent, etc. It is a globally optimal solution as long as the data is linearly separable; and it is cheap in terms of computational complexity.
