Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ContentMimeDetection" page has been changed by Lukeliush: https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=9&rev2=10 some files some bytes have higher frequencies whereas other bytes are less frequent, or in a critical situation, some files have only one or two bins that occupy the majority of the count, this makes a large gap between the most frequent and less frequent, the solution is to apply a companding function - A law or u law; square-rooting the bin values also provide the same effect, so by considering the computational cost, the square-root is chosen to enhance the histogram detail in place of A law or u law. - {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}} + {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||height="260px;",width="561px;"}} A-law companding function curve + + {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}} Square-root function curve The following shows the difference + {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}} + Byte frequencies '''without''' any companding. + + {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}} Byte frequencies with A-law @@ -79, +85 @@ alaw <- function(x, A=87.7){ . th = 1/A + cond1 <- (x>=0 && x < th) cond2 <- (x>=th && x <= 1) x[cond1] <- A * abs(x[cond1]) / (1+log(A)) x[cond2] <- sign(x[cond2])*(1+log(A*abs(x[cond2])))/(1+log(A)) x - cond1 <- (x>=0 && x < th) - cond2 <- (x>=th && x <= 1) - x[cond1] <- A * abs(x[cond1]) / (1+log(A)) - x[cond2] <- sign(x[cond2])*(1+log(A*abs(x[cond2])))/(1+log(A)) x } @@ -92, +95 @@ For details of A-Law, please refer to http://en.wikipedia.org/wiki/A-law_algorithm. + {{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||transform="rotate(0.00rad)",height="251px;",width="541px;",-webkit-transform="rotate(0.00rad)",border="none"}} + Byte frequencies with square root (power of 1/2) + + {{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||transform="rotate(0.00rad)",height="251px;",width="541px;",-webkit-transform="rotate(0.00rad)",border="none"}} Byte frequencies with power of 1/3 @@ -142, +149 @@ It is also possible that your model might use different size of input of byte histograms, some might consider a different bin size with some heuristics specific to their own data, in that case, it is possible to overwrite the readByteFrequencies(final InputStream input) :: TrainedModelDetector by providing your own version of byte histograms, and you also need to ensure the model parameters are used and set to reflect the same size of input. + . TrainedModelDetector implements the Detector interface, but it is abstract meaning we need to subclass it with our own version of TrainedModelDetector. ExampleNNModelDetector is its subclass, the purpose of subclassing the TrainedModelDetector is to supply the implementation of the method of loadDefaultModels that reads and registers the models into the model map <MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is populated with a set of mappings with keys and values, the detect method in the TrainedModelDetector will be able to use the loaded models to predict the mime types. The job of the TrainedModelDetector is to convert the given input stream to byte frequency histogram and pass that as the input to the models that have been loaded or registered in the map. There is also a TrainedModel(abstract) and its subclass NNTrainedModel. The TrainedModel is an abstract class that represents an abstraction of a trained model; a model object must have a method of “predict” with input of byte histogram vector, it returns a probability of prediction. The following lists all of the classes for this feature (tika\tika-core\src\main\java) org.apache.tika.detect.TrainedModelDetector (abstract) org.apache.tika.detect.ExampleNNModelDetector org.apache.tika.detect.TrainedModel (abstract) org.apache.tika.detect.NNTrainedModel Example model file (tika\tika-core\src\main\resources) org.apache.tika.detect.tika-example.nnmodel Unit test (tika\tika-core\src\test\java) - . TrainedModelDetector implements the Detector interface, but it is abstract meaning we need to subclass it with our own version of TrainedModelDetector. - ExampleNNModelDetector is its subclass, the purpose of subclassing the TrainedModelDetector is to supply the implementation of the method of loadDefaultModels that reads and registers the models into the model map <MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is populated with a set of mappings with keys and values, the detect method in the TrainedModelDetector will be able to use the loaded models to predict the mime types. - The job of the TrainedModelDetector is to convert the given input stream to byte frequency histogram and pass that as the input to the models that have been loaded or registered in the map. - There is also a TrainedModel(abstract) and its subclass NNTrainedModel. - The TrainedModel is an abstract class that represents an abstraction of a trained model; a model object must have a method of “predict” with input of byte histogram vector, it returns a probability of prediction. The following lists all of the classes for this feature (tika\tika-core\src\main\java) - org.apache.tika.detect.TrainedModelDetector (abstract) org.apache.tika.detect.ExampleNNModelDetector - org.apache.tika.detect.TrainedModel (abstract) org.apache.tika.detect.NNTrainedModel Example model file (tika\tika-core\src\main\resources) org.apache.tika.detect.tika-example.nnmodel Unit test (tika\tika-core\src\test\java) . org.apache.tika.detect. MimeDetectionWithNNTest
