Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ContentMimeDetection" page has been changed by Lukeliush: https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=10&rev2=11 A-law companding function curve - {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}} + {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||height="260px;",width="561px;"}} Square-root function curve The following shows the difference - {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}} + {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||height="260px;",width="561px;"}} Byte frequencies '''without''' any companding. - {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||transform="rotate(0.00rad)",height="260px;",width="561px;",-webkit-transform="rotate(0.00rad)",border="none"}} + {{https://lh6.googleusercontent.com/Soeeu7bv02MOLRulV9mMKy3WTb2RXU1PafO47m1g2_i8ATiVpBkTcgCozMG9VIgDENa7MYU-DbXctIK4iIWRZAnsJEg_Ye49tTN0FqnRrxmUsOTo3Ap9vaAeI4m9XiEceIeaIC4||height="260px;",width="561px;"}} Byte frequencies with A-law @@ -95, +95 @@ For details of A-Law, please refer to http://en.wikipedia.org/wiki/A-law_algorithm. - {{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||transform="rotate(0.00rad)",height="251px;",width="541px;",-webkit-transform="rotate(0.00rad)",border="none"}} + {{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||height="251px;",width="541px;"}} Byte frequencies with square root (power of 1/2) - {{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||transform="rotate(0.00rad)",height="251px;",width="541px;",-webkit-transform="rotate(0.00rad)",border="none"}} + {{https://lh6.googleusercontent.com/lMLfrOj-Esc8kHSAY_Ly2em1qJLPFFczaKn0jvWDT13-OJDn8JPmOu8pHO72jCpK6SDmskWhjONDPRFBPw6vM-wCOfRNDiLtBISJNawABNQGuABnR8hF2vA4cWMmFXDoRSwtMGI||height="251px;",width="541px;"}} Byte frequencies with power of 1/3 @@ -129, +129 @@ The following line in main.R is the last line used to output the model, the name and structure can be customized according to different relish. + {{https://lh4.googleusercontent.com/9NhU8MSntrg9JRxV55sG89v5MkBM_ZzI9wo5SoYN3chzirIB_R97VImM4LUc6Cps1wJSfDlZCAE-OdCCj6OGBmeGHyKn8falen0APY1UY0B4xgCZ1EUEX3JVYcqxznNEQ2ygXpw||transform="rotate(0.00rad)",height="27px;",width="602px;",-webkit-transform="rotate(0.00rad)",border="none"}} + The exportNNParams method implementation resides in the utility class i.e. ‘myfunctions.R’; it can be also customized or replaced to create your own model file with different syntax or structure. The following shows what the outputted model look like in that model text file. @@ -137, +139 @@ The next line without # at the front shows a series of floating numbers separated by a tab, and they are model parameters, later we need to import the file into Tika and have the ExampleNNModelDetector to recreate the trained model with them in Tika so it can predict and classify the unseen file and determine with the imported model whether the given input file is a GRB or non-GRB type. + {{https://lh6.googleusercontent.com/ZkRhFs9ON4ELXTtClE9s0frCEsC_i7ktsWkmGlm10ktOCpJMorMB_UZA2K4pp6LIc8AK0c2LKhgss7ZQkhTop4eh9BBDYn-kQlC17PB21VUdMYjtvpHbUjY51XyS2iOgxSYjUIo||transform="rotate(0.00rad)",height="43px;",width="602px;",-webkit-transform="rotate(0.00rad)",border="none"}} + The following shows the printing formation produced by the R program after training in a bit more detail with the outputted/chosen model above. - . [1] "Loading Dataset....." [1] "Begining Training Neural Networks" [1] "the length of weights 517" [1] "The time taken for training: 330.257000" [1] "The training error cost: 0.001380" [1] "The validation error cost: 0.025099" [1] "The testing error cost: 0.020883" [1] "Training Accuracy: 100.000000" [1] "Validation Accuracy: 99.650000" [1] "Testing Accuracy: 99.762349" + . [1] "Loading Dataset....." + [1] "Begining Training Neural Networks" + [1] "the length of weights 517" + [1] "The time taken for training: 330.257000" + [1] "The training error cost: 0.001380" + [1] "The validation error cost: 0.025099" + [1] "The testing error cost: 0.020883" + [1] "Training Accuracy: 100.000000" + [1] "Validation Accuracy: 99.650000" + [1] "Testing Accuracy: 99.762349" ''''Import the model into Tika '''' @@ -149, +162 @@ It is also possible that your model might use different size of input of byte histograms, some might consider a different bin size with some heuristics specific to their own data, in that case, it is possible to overwrite the readByteFrequencies(final InputStream input) :: TrainedModelDetector by providing your own version of byte histograms, and you also need to ensure the model parameters are used and set to reflect the same size of input. + TrainedModelDetector implements the Detector interface, but it is abstract meaning we need to subclass it with our own version of TrainedModelDetector. - . TrainedModelDetector implements the Detector interface, but it is abstract meaning we need to subclass it with our own version of TrainedModelDetector. ExampleNNModelDetector is its subclass, the purpose of subclassing the TrainedModelDetector is to supply the implementation of the method of loadDefaultModels that reads and registers the models into the model map <MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is populated with a set of mappings with keys and values, the detect method in the TrainedModelDetector will be able to use the loaded models to predict the mime types. The job of the TrainedModelDetector is to convert the given input stream to byte frequency histogram and pass that as the input to the models that have been loaded or registered in the map. There is also a TrainedModel(abstract) and its subclass NNTrainedModel. The TrainedModel is an abstract class that represents an abstraction of a trained model; a model object must have a method of “predict” with input of byte histogram vector, it returns a probability of prediction. The following lists all of the classes for this feature (tika\tika-core\src\main\java) org.apache.tika.detect.TrainedModelDetector (abstract) org.apache.tika.detect.ExampleNNModelDetector org.apache.tika.detect.TrainedModel (abstract) org.apache.tika.detect.NNTrainedModel Example model file (tika\tika-core\src\main\resources) org.apache.tika.detect.tika-example.nnmodel Unit test (tika\tika-core\src\test\java) - . org.apache.tika.detect. MimeDetectionWithNNTest + ExampleNNModelDetector is its subclass, the purpose of subclassing the TrainedModelDetector is to supply the implementation of the method of loadDefaultModels that reads and registers the models into the model map <MediaType, TrainedModel> in the TrainedModelDetector. Once the model map is populated with a set of mappings with keys and values, the detect method in the TrainedModelDetector will be able to use the loaded models to predict the mime types. + + The job of the TrainedModelDetector is to convert the given input stream to byte frequency histogram and pass that as the input to the models that have been loaded or registered in the map. + + There is also a TrainedModel(abstract) and its subclass NNTrainedModel. + + The TrainedModel is an abstract class that represents an abstraction of a trained model; a model object must have a method of “predict” with input of byte histogram vector, it returns a probability of prediction. + + The following lists all of the classes for this feature (tika\tika-core\src\main\java) + + org.apache.tika.detect.TrainedModelDetector (abstract) + + org.apache.tika.detect.ExampleNNModelDetector + + org.apache.tika.detect.TrainedModel (abstract) + + org.apache.tika.detect.NNTrainedModel + + Example model file (tika\tika-core\src\main\resources) + + org.apache.tika.detect.tika-example.nnmodel + + Unit test (tika\tika-core\src\test\java) + + org.apache.tika.detect. MimeDetectionWithNNTest +
