Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ContentMimeDetection" page has been changed by Lukeliush: https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=24&rev2=25 Some files are very huge in size, building byte histograms for those files requires significant amount of time, but it is worth noting that with domain specific knowledge or the heuristics (e.g. there might be some crucial and critical regions in the file that could help with the detection.), we can further reduce the amount of effort required for knowledge discovery or mining particular patterns that we can use in the type detection. - Please also note, this content based mime detection does require users to have some knowledge with data mining and machine learning, and the choice of learning algorithms used in the pattern mining does not seem to matter, the knowledge to be mined is actually the classification, and there are many classification learning algorithms invented or reinvented, the question of which one is the best depends on a goal and data, each of the learning algorithms requires lots of effort for performance testing, and some data might be linear separable, some are not; and a or a set of goals is very important as it often is in the context of performance tuning; we can also think about it as a performance tuning problem where we need to set a set of goals in terms of the scalability, complexity or accuracy, so we want to leave the choice of algorithms to users based on their goals and data in their environment. As an example, we have actually implemented two algorithms for classifying the GRB file type from non-GRB types, one is linear logistic regression and the other is neural network. Again, the neural network with back-propagation is a bit more complex with training, and the logistic regression is far cheaper in terms of complexity, and it turns out that logistic regression also gives a good result with high accuracy, and it is worthy noting that it is always better to circumscribe the mime types to be detected; in the example model we have built, we attempt to classify grb files from non-grb files, and one of the challenges is to identify the non-grb file types whose class can be enormously large, the best practice is again to circumscribe a set of types to be classified, and the domain specific knowledge come into the play for well-defining a set of types in the user specific environment. + Please also note, this content based mime detection does require users to have some knowledge with data mining and machine learning, and the choice of learning algorithms used in the pattern mining does not seem to matter, the knowledge to be mined is the classification, and there are many classification learning algorithms invented or reinvented, the question of which one is the best depends on a goal and data, each of the learning algorithms requires lots of effort specifically with thorough performance tuning testing and emperical analysis, and some data might be linear separable, some are not; and a or a set of goals is very important as it often is in the context of performance tuning; we can also think about it as a performance tuning problem where we need to have a set of goals in terms of the scalability, complexity or accuracy, so we want to leave the choice of algorithms to users based on their goals and data in their environment. As an example, we have actually implemented two algorithms for classifying the GRB file type from non-GRB types, one is linear logistic regression ( gradient descent) and the other is neural network (backpropagation). Again, the neural network with back-propagation is a bit more complex with training, and the logistic regression is far cheaper in terms of complexity, and with the collected GRB data, it turns out that logistic regression also gives a good result with high accuracy, and it is worthy noting that it is always better to circumscribe the mime types to be detected; the example model attempts to classify grb files from non-grb files, and one of the observed challenges is to identify the non-grb file types whose class can be enormously large, the best practice is again to circumscribe a set of types to be classified, and the domain specific knowledge come into the play for well-defining a set of types in the user specific environment. - This feature could also enhance identification safety, so it only trusts the file with the type which has the similar byte histogram pattern it has seen in the training set, this has pros and cons, one of the pros as mentioned is that it enhance the security aspect of the file type identification, but the cons is slow detection which requires the reading the entire bytes of a file for computing the byte histogram and it might be also myopic to the training data which might be biased or less representative. + This feature could also enhance identification safety, so it only trusts the files that have similar byte histogram patterns it has seen in its training set, this has pros and cons, one of the pros as mentioned is that it enhance the security aspect of the MIME type identification, but the cons is slow detection which requires the reading the entire bytes of a file for computing the byte histogram and it might be also myopic to the training data which might be biased or less representative. '''Methods''' @@ -29, +29 @@ Please also refer to the code repo for details of the implementation for training a model, the neural network and logistic regression learning are all implemented in R and the following briefly describes the pre-processing and learning implementation in R and how to load the model parameters trained from the R programs into the Tika for mime detection. - The training program can be created or written in any programming language, the R implemenation is posted as an example, Tika only needs to load the well-trained model parameters from the training program and be able to use them. The job of the feature in Tika generally have 4 steps as follow, and also it is flexible that you can overwrite the detect method of the TrainedModelDetector to define your own selected features if you have different features defined for training. + The training program can be created or written in any programming language, the R implemenation is posted as an example, Tika only needs to load the well-trained model parameters from the training program and be able to use them to make predictions. The job of the feature in Tika generally have 4 steps as follow, and also it is flexible that you can overwrite the detect method of the TrainedModelDetector to define your own selected features if you have different features defined when training. 1. read the input in bytes 1. convert it to the byte histogram
