Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ContentMimeDetection" page has been changed by Lukeliush: https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=3&rev2=4 The feature of TIKA-1582 is an extension of TIKA MIME detection based on file contents, i.e. the file byte histograms, and this feature follows a standard data mining process that extracts the knowledge out of the data (bytes). The motivation of this feature is to offer users with an option where content based detection approach can be used, the content can be defined in several ways, they can be the entire file bytes, byte n-grams, byte histograms, etc. In this feature, the content byte histogram is used. - Some files are very huge in size, building byte histograms for those files requires significant amount time, but it is worth noting that with domain specific knowledge or the heuristics (e.g. there might be some crucial and critical regions in the file that could help with the detection.), we can further reduce the amount of effort required for knowledge discovery or mining particular patterns that we can use in the type detection. + Some files are very huge in size, building byte histograms for those files requires significant amount of time, but it is worth noting that with domain specific knowledge or the heuristics (e.g. there might be some crucial and critical regions in the file that could help with the detection.), we can further reduce the amount of effort required for knowledge discovery or mining particular patterns that we can use in the type detection. - Please also note, this content based mime detection does require users to have some knowledge with data mining and machine learning, and the choice of learning algorithms used in the pattern mining does not seem to matter, the knowledge to be mined is actually the classification, and there are many classification learning algorithms invented or revented, the question of which one is the best depends on a goal and data, each of the learning algorithms requires lots of effort for performance testing, and some data might be linear sepeartable, some are not; and a or a set of goals is very important as it often is in the context of performance tuning; we can also think about it as a performance tuning problem where we need to set a set of goals in terms of the scalability, complexity or accuracy, so we want to leave the choice of algorithms to users based on their goals and data in their enviroment. As an example, we have actually implemented two algorithms for mining patterns with the GRB file types, one is linear logistic regression and the other is neural network. Again, the neural network with back-propagation is a bit more complex with training, and logistic regression is far cheaper in terms of complexity, and it turns out that logistic regression also gives a good result with high accuracy, and it is worthy noting that it is always better to circumscribe the mime types to be detected; in the example model we have built, we attempt to classify grb files from non-grb files, and one of the challenges is to identify the non-grb file types whose class can be enormously large, the best practice is to circumscribe a set of types to be classified, again domain specific knowledge come into the play for well-defining a set of types in the user specific enviroment. + Please also note, this content based mime detection does require users to have some knowledge with data mining and machine learning, and the choice of learning algorithms used in the pattern mining does not seem to matter, the knowledge to be mined is actually the classification, and there are many classification learning algorithms invented or reinvented, the question of which one is the best depends on a goal and data, each of the learning algorithms requires lots of effort for performance testing, and some data might be linear separable, some are not; and a or a set of goals is very important as it often is in the context of performance tuning; we can also think about it as a performance tuning problem where we need to set a set of goals in terms of the scalability, complexity or accuracy, so we want to leave the choice of algorithms to users based on their goals and data in their environment. As an example, we have actually implemented two algorithms for mining patterns with the GRB file types, one is linear logistic regression and the other is neural network. Again, the neural network with back-propagation is a bit more complex with training, and logistic regression is far cheaper in terms of complexity, and it turns out that logistic regression also gives a good result with high accuracy, and it is worthy noting that it is always better to circumscribe the mime types to be detected; in the example model we have built, we attempt to classify grb files from non-grb files, and one of the challenges is to identify the non-grb file types whose class can be enormously large, the best practice is to circumscribe a set of types to be classified, again domain specific knowledge come into the play for well-defining a set of types in the user specific environment. - This approach could also enhance identification safty, so it only trusts the file with the type which has the similar byte histogram pattern it has seen in the training, this has pros and cons, the pros is that it enhance the security aspect of the file type identification, but the cons is slow detection which requires the reading the entire bytes of a file for computing the byte histogram and it might be also myopic to the training data which might be less representative. + This approach could also enhance identification safety, so it only trusts the file with the type which has the similar byte histogram pattern it has seen in the training, this has pros and cons, the pros is that it enhance the security aspect of the file type identification, but the cons is slow detection which requires the reading the entire bytes of a file for computing the byte histogram and it might be also myopic to the training data which might be less representative. Methods: As mentioned, the content-based mime detection follows a standard data mining process: - Raw data - > feature selection and data cleaning -> preprocessing and transformation -> learning patterns(machine learning) -> knowledge evaluation -> the use of knowledge(prediction/classification) In Tika. + Raw data - > feature selection and data cleaning -> pre-processing and transformation -> learning patterns(machine learning) -> knowledge evaluation -> the use of knowledge(prediction/classification) In TIKA. - (It is worth noting that the feature selection requires learning the application domain which in our case is specific to the user domain and enviroment) + (It is worth noting that the feature selection requires learning the application domain which in our case is specific to the user domain and environment) - Also please note the model has to be ready befored it can be used in Tika; by "ready", we mean the model has to pass the final knowledge evaluation test. As shall be seen shortly, as an example Tika is only implementing the prediction phase, so the model parameters need to be loaded and read into tika for prediction or classification; The process of training can be lengthy and tedious, sometimes training might require parallel computation on e.g. map-reduce when training data is too large to fit memory, again this depends on the user's goal. + Also please note the model has to be ready before it can be used in Tika; by "ready", we mean the model has to pass the final knowledge evaluation test. As shall be seen shortly, as an example Tika is only implementing the prediction phase, so the model parameters need to be loaded and read into Tika for prediction or classification; The process of training can be lengthy and tedious, sometimes training might require parallel computation on e.g. map-reduce when training data is too large to fit memory, again this depends on the user's goal. ''The following will briefly walk you through how the feature and example is implemented in this data problem.'' - Please also refer to the code repo for details of implemenation for training or preparing for a model, the neural network and logistic regression learning are implemented in R and the following describes the preprocessing and learning implemenation in R. + Please also refer to the code repo for details of implementation for training or preparing for a model, the neural network and logistic regression learning are implemented in R and the following describes the pre processing and learning implementation in R. @@ -80, +80 @@ All of the sets are treated as matrices which need to be saved as files; those files are loaded into the R program thru the ‘loadAndProcess.R’; - '''Preprocessing''' + '''Pre processing''' 1) Read byte content of the file build byte histogram. Build frequency by dividing each bin value with the max count of occurrence to have each bin value to fall in the range between 0 and 1. - as isome files some bytes have higher frequencies whereas other bytes are less frequent, or in a critical situation, some files have only one or two bins that occupy the majority of the count, this makes a large gap between the most frequent and less frequent, the solution is to apply a companding function - A law or u law; square-rooting the bin values also provide the same effect, so by considering the computational cost, the square-root is chosen to enhance the histogram detail in place of A law or u law. + some files some bytes have higher frequencies whereas other bytes are less frequent, or in a critical situation, some files have only one or two bins that occupy the majority of the count, this makes a large gap between the most frequent and less frequent, the solution is to apply a companding function - A law or u law; square-rooting the bin values also provide the same effect, so by considering the computational cost, the square-root is chosen to enhance the histogram detail in place of A law or u law. @@ -167, +167 @@ Once we have preprocessed our inputs, i.e. byte histograms, we are then ready to train a model with a machine learning algorithm. - The Neural network can be seen as a function, in this case its input is a vector of the preprocessed histogram and its output simply is a yes/no (1 or 0); With neural network, we can actually have a probability that might tell how likely it believes a given input histogram is a GRB or non-GRB, again it is worth stressing that non-GRB is a huge class to be classified, we might need to have a s many negative training examples as possible, but again if we know what types we are dealing with, the problem might be further simplified with smaller set of classes; Also it is worthy noting, training with too many negative examples can also produce an unpromising result, in an extreme cases where you might have 10 positve examples and 10 million negative examples, with a huge different like this it is likely you might come up with a biased model towards the one that dump everything it has seen into negative class, so the choice of training set might be important, there are some cross-validation method that might help assuage this bias .e.g we can randomly pick some portion of negative training data, but again this leads to very thorough performance testing with each of the models you have trained with different data or even the training parameters such as a different regularized term, different structure (different layers and different number of weight units). In addtions, the choice of the structure or the tuning parameters depend on how well the model fit the data, when it overfits the training data, we might want to adjust the regularized terms or add more training data; but when it underfits, we might also want to increase the complexity of the network structure, but again the choice of strucutre depends on the patterns hidden in the data. + The Neural network can be seen as a function, in this case its input is a vector of the preprocessed histogram and its output simply is a yes/no (1 or 0); With neural network, we can actually have a probability that might tell how likely it believes a given input histogram is a GRB or non-GRB, again it is worth stressing that non-GRB is a huge class to be classified, we might need to have a s many negative training examples as possible, but again if we know what types we are dealing with, the problem might be further simplified with smaller set of classes; Also it is worthy noting, training with too many negative examples can also produce an unpromising result, in an extreme cases where you might have 10 positive examples and 10 million negative examples, with a huge different like this it is likely you might come up with a biased model towards the one that dump everything it has seen into negative class, so the choice of training set might be important, there are some cross-validation method that might help assuage this bias .e.g we can randomly pick some portion of negative training data, but again this leads to very thorough performance testing with each of the models you have trained with different data or even the training parameters such as a different regularized term, different structure (different layers and different number of weight units). In additions, the choice of the structure or the tuning parameters depend on how well the model fit the data, when it over-fits the training data, we might want to adjust the regularized terms or add more training data; but when it under-fits, we might also want to increase the complexity of the network structure, but again the choice of structure depends on the patterns hidden in the data. - The linear logistic regression training seems to be less complex compared to neural network training, which can be implemented with svm, gradient descent, etc. It is a globably optimal solution as long as the data is linearly seperable; and it is cheap in terms of computational complexity. + The linear logistic regression training seems to be less complex compared to neural network training, which can be implemented with svm, gradient descent, etc. It is a globally optimal solution as long as the data is linearly separable; and it is cheap in terms of computational complexity. '''Evaluation''': - Once we finish training, we need to score our model and decide whether the model meets our goal, then the knowledege Evaluation is also very significant in the process, + Once we finish training, we need to score our model and decide whether the model meets our goal, then the knowledge Evaluation is also very significant in the process, This is where the prepared test set is used. Again, the details of performance evaluation such as recall, precision, ROC, etc are skipped, but the idea is to decide whether our model meets our goals.
