Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ContentMimeDetection" page has been changed by Lukeliush: https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=8&rev2=9 Tika - Content based MIME type Detection - - - - JIRA issue with the TIKA feature @@ -12, +8 @@ '''Motivation''' - - - - - The feature of TIKA-1582 is an extension of TIKA MIME detection based on file contents, i.e. the file byte histograms, and this feature follows a standard data mining process that extracts the knowledge out of the data (bytes). The motivation of this feature is to offer users with an option where content based detection approach can be used, the content can be defined in several ways, they can be the entire file bytes, byte n-grams, byte histograms, etc. In this feature, the content byte histogram is used. + The feature of TIKA-1582 is an extension of TIKA MIME detection based on file contents, i.e. the file byte histograms, and this feature follows a standard data mining process that extracts the knowledge out of the data (bytes). The motivation of this feature is to offer users with an option where content based detection approach can be used, the content can be defined in several ways, they can be the entire file bytes, byte n-grams, byte histograms, etc. In this feature, the content byte histogram is used. Some files are very huge in size, building byte histograms for those files requires significant amount of time, but it is worth noting that with domain specific knowledge or the heuristics (e.g. there might be some crucial and critical regions in the file that could help with the detection.), we can further reduce the amount of effort required for knowledge discovery or mining particular patterns that we can use in the type detection. @@ -38, +30 @@ Please also refer to the code repo for details of implementation for training or preparing for a model, the neural network and logistic regression learning are implemented in R and the following describes the pre processing and learning implementation in R. - - - - Project source repository https://github.com/LukeLiush/filetypeDetection - - - - The goal of the example is to be able to classify grb file types from non-grb types. - - - - '''Data preparation''' - - - - The positive training examples are collected from the AMD polar web sites (*.gsfc.nasa.gov). i.e. ftp://hydro1.sci.gsfc.nasa.gov/data/ The negative training examples are collected from the following i.e. http://digitalcorpora.org/corp/files/govdocs1/zipfiles/ - - Once GRB and non-GRB files are collected, the next step is to prepare our data set so as to allow R to easily manipulate. @@ -88, +62 @@ some files some bytes have higher frequencies whereas other bytes are less frequent, or in a critical situation, some files have only one or two bins that occupy the majority of the count, this makes a large gap between the most frequent and less frequent, the solution is to apply a companding function - A law or u law; square-rooting the bin values also provide the same effect, so by considering the computational cost, the square-root is chosen to enhance the histogram detail in place of A law or u law. + {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}} - - - - - - A-law companding function curve - - Square-root function curve - - - - - - The following shows the difference Byte frequencies '''without''' any companding. - - Byte frequencies with A-law - - The following is the a-alaw formula implementation in R. alaw <- function(x, A=87.7){ - th = 1/A + . th = 1/A - - cond1 <- (x>=0 && x < th) + cond1 <- (x>=0 && x < th) - - cond2 <- (x>=th && x <= 1) + cond2 <- (x>=th && x <= 1) - - x[cond1] <- A * abs(x[cond1]) / (1+log(A)) + x[cond1] <- A * abs(x[cond1]) / (1+log(A)) - - x[cond2] <- sign(x[cond2])*(1+log(A*abs(x[cond2])))/(1+log(A)) + x[cond2] <- sign(x[cond2])*(1+log(A*abs(x[cond2])))/(1+log(A)) x - - x } @@ -141, +92 @@ For details of A-Law, please refer to http://en.wikipedia.org/wiki/A-law_algorithm. - - - - Byte frequencies with square root (power of 1/2) - - - - - - Byte frequencies with power of 1/3 @@ -163, +104 @@ The Neural network can be seen as a function, in this case its input is a vector of the preprocessed histogram and its output simply is a yes/no (1 or 0); With neural network, we can actually have a probability that might tell how likely it believes a given input histogram is a GRB or non-GRB, again it is worth stressing that non-GRB is a huge class to be classified, we might need to have a s many negative training examples as possible, but again if we know what types we are dealing with, the problem might be further simplified with smaller set of classes; Also it is worthy noting, training with too many negative examples can also produce an unpromising result, in an extreme cases where you might have 10 positive examples and 10 million negative examples, with a huge different like this it is likely you might come up with a biased model towards the one that dump everything it has seen into negative class, so the choice of training set might be important, there are some cross-validation method that might help assuage this bias .e.g we can randomly pick some portion of negative training data, but again this leads to very thorough performance testing with each of the models you have trained with different data or even the training parameters such as a different regularized term, different structure (different layers and different number of weight units). In additions, the choice of the structure or the tuning parameters depend on how well the model fit the data, when it over-fits the training data, we might want to adjust the regularized terms or add more training data; but when it under-fits, we might also want to increase the complexity of the network structure, but again the choice of structure depends on the patterns hidden in the data. - - The linear logistic regression training seems to be less complex compared to neural network training, which can be implemented with svm, gradient descent, etc. It is a globally optimal solution as long as the data is linearly separable; and it is cheap in terms of computational complexity. '''Evaluation''': @@ -173, +112 @@ This is where the prepared test set is used. Again, the details of performance evaluation such as recall, precision, ROC, etc are skipped, but the idea is to decide whether our model meets our goals. - - '''Use of the knowledge''' ''''Output the model '''' - - - - - - - - When finishing neural network training, in the end the model parameters and configuration (e.g. number of input units, hidden units, etc) are written in a text file called ‘tika-example.nnmodel’ in the same directory with ‘main.R’; As we need to copy this file to Tika to allow Tika to detect the type for which the model is trained e.g. GRB type, note you can create many models for many different mime types, but GRB file type detection is discussed and used as one example to demonstrate the use. The following line in main.R is the last line used to output the model, the name and structure can be customized according to different relish. - - The exportNNParams method implementation resides in the utility class i.e. ‘myfunctions.R’; it can be also customized or replaced to create your own model file with different syntax or structure. @@ -203, +130 @@ The next line without # at the front shows a series of floating numbers separated by a tab, and they are model parameters, later we need to import the file into Tika and have the ExampleNNModelDetector to recreate the trained model with them in Tika so it can predict and classify the unseen file and determine with the imported model whether the given input file is a GRB or non-GRB type. - - The following shows the printing formation produced by the R program after training in a bit more detail with the outputted/chosen model above. + . [1] "Loading Dataset....." [1] "Begining Training Neural Networks" [1] "the length of weights 517" [1] "The time taken for training: 330.257000" [1] "The training error cost: 0.001380" [1] "The validation error cost: 0.025099" [1] "The testing error cost: 0.020883" [1] "Training Accuracy: 100.000000" [1] "Validation Accuracy: 99.650000" [1] "Testing Accuracy: 99.762349" - [1] "Loading Dataset....." - - [1] "Begining Training Neural Networks" - - [1] "the length of weights 517" - - [1] "The time taken for training: 330.257000" - - [1] "The training error cost: 0.001380" - - [1] "The validation error cost: 0.025099" - - [1] "The testing error cost: 0.020883" - - [1] "Training Accuracy: 100.000000" - - [1] "Validation Accuracy: 99.650000" - - [1] "Testing Accuracy: 99.762349" - - - - ''''Import the model into Tika '''' - - Once the training is done, there is a model file that is generated as mentioned above. The above model file only have one model, however you can have multiple models written in that file or you can have several model files according to your needs. @@ -241, +142 @@ It is also possible that your model might use different size of input of byte histograms, some might consider a different bin size with some heuristics specific to their own data, in that case, it is possible to overwrite the readByteFrequencies(final InputStream input) :: TrainedModelDetector by providing your own version of byte histograms, and you also need to ensure the model parameters are used and set to reflect the same size of input. - TrainedModelDetector implements the Detector interface, but it is abstract meaning we need to subclass it with our own version of TrainedModelDetector. + . TrainedModelDetector implements the Detector interface, but it is abstract meaning we need to subclass it with our own version of TrainedModelDetector. + ExampleNNModelDetector is its subclass, the purpose of subclassing the TrainedModelDetector is to supply the implementation of the method of loadDefaultModels that reads and registers the models into the model map <MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is populated with a set of mappings with keys and values, the detect method in the TrainedModelDetector will be able to use the loaded models to predict the mime types. + The job of the TrainedModelDetector is to convert the given input stream to byte frequency histogram and pass that as the input to the models that have been loaded or registered in the map. + There is also a TrainedModel(abstract) and its subclass NNTrainedModel. + The TrainedModel is an abstract class that represents an abstraction of a trained model; a model object must have a method of “predict” with input of byte histogram vector, it returns a probability of prediction. The following lists all of the classes for this feature (tika\tika-core\src\main\java) + org.apache.tika.detect.TrainedModelDetector (abstract) org.apache.tika.detect.ExampleNNModelDetector + org.apache.tika.detect.TrainedModel (abstract) org.apache.tika.detect.NNTrainedModel Example model file (tika\tika-core\src\main\resources) org.apache.tika.detect.tika-example.nnmodel Unit test (tika\tika-core\src\test\java) + . org.apache.tika.detect. MimeDetectionWithNNTest - ExampleNNModelDetector is its subclass, the purpose of subclassing the TrainedModelDetector is to supply the implementation of the method of loadDefaultModels that reads and registers the models into the model map <MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is populated with a set of mappings with keys and values, the detect method in the TrainedModelDetector will be able to use the loaded models to predict the mime types. - - The job of the TrainedModelDetector is to convert the given input stream to byte frequency histogram and pass that as the input to the models that have been loaded or registered in the map. - - There is also a TrainedModel(abstract) and its subclass NNTrainedModel. - - The TrainedModel is an abstract class that represents an abstraction of a trained model; a model object must have a method of “predict” with input of byte histogram vector, it returns a probability of prediction. - - The following lists all of the classes for this feature (tika\tika-core\src\main\java) - - org.apache.tika.detect.TrainedModelDetector (abstract) - - org.apache.tika.detect.ExampleNNModelDetector - - org.apache.tika.detect.TrainedModel (abstract) - - org.apache.tika.detect.NNTrainedModel - - Example model file (tika\tika-core\src\main\resources) - - org.apache.tika.detect.tika-example.nnmodel - - Unit test (tika\tika-core\src\test\java) - - org.apache.tika.detect. MimeDetectionWithNNTest -
