Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ContentMimeDetection" page has been changed by Lukeliush: https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=7&rev2=8 - '''Use of the knowledge''' + '''Use of the knowledge''' - '''Output the model <<BR>>''' + ''''Output the model '''' @@ -187, +187 @@ - When finishing neural network training, in the end the model parameters and configuration (e.g. number of input units, hidden units, etc) are written in a text file called ‘tika-example.nnmodel’ in the same directory with ‘main.R’; + When finishing neural network training, in the end the model parameters and configuration (e.g. number of input units, hidden units, etc) are written in a text file called ‘tika-example.nnmodel’ in the same directory with ‘main.R’; - As we need to copy this file to Tika to allow Tika to detect the type for which the model is trained e.g. GRB type, note you can create many models for many different mime types, but GRB file type detection is discussed and used as one example to demonstrate the use. + As we need to copy this file to Tika to allow Tika to detect the type for which the model is trained e.g. GRB type, note you can create many models for many different mime types, but GRB file type detection is discussed and used as one example to demonstrate the use. - The following line in main.R is the last line used to output the model, the name and structure can be customized according to different relish. + The following line in main.R is the last line used to output the model, the name and structure can be customized according to different relish. - The exportNNParams method implementation resides in the utility class i.e. ‘myfunctions.R’; it can be also customized or replaced to create your own model file with different syntax or structure. + The exportNNParams method implementation resides in the utility class i.e. ‘myfunctions.R’; it can be also customized or replaced to create your own model file with different syntax or structure. - The following shows what the outputted model look like in that model text file. + The following shows what the outputted model look like in that model text file. - The first line begins with # which indicates that this line is a model description that tells the type to be classified, the number of inputs, number of hidden units, output units and test set error cost; they are delimited by a tab. + The first line begins with # which indicates that this line is a model description that tells the type to be classified, the number of inputs, number of hidden units, output units and test set error cost; they are delimited by a tab. - The next line without # at the front shows a series of floating numbers separated by a tab, and they are model parameters, later we need to import the file into Tika and have the ExampleNNModelDetector to recreate the trained model with them in Tika so it can predict and classify the unseen file and determine with the imported model whether the given input file is a GRB or non-GRB type. + The next line without # at the front shows a series of floating numbers separated by a tab, and they are model parameters, later we need to import the file into Tika and have the ExampleNNModelDetector to recreate the trained model with them in Tika so it can predict and classify the unseen file and determine with the imported model whether the given input file is a GRB or non-GRB type. - The following shows the printing formation produced by the R program after training in a bit more detail with the outputted/chosen model above. + The following shows the printing formation produced by the R program after training in a bit more detail with the outputted/chosen model above. [1] "Loading Dataset....." @@ -231, +231 @@ - '''Import the model into Tika ''' + ''''Import the model into Tika '''' Once the training is done, there is a model file that is generated as mentioned above. The above model file only have one model, however you can have multiple models written in that file or you can have several model files according to your needs. - Copy the ‘tika-example.nnmodel’ to the default directory of tika\tika-core\test \resources\org\apache\tika\detect\, alternatively in your own version of TrainedModelDetector, you invoke getDefaultModel with a different model file location, the purpose of this method is to read the model files and load those models into memory as an object instance i.e. TrainedModel; If your model file(s) have a different syntax or format, you might need overwrite this method getDefaultModel to provide reading and loading implementation that respect your syntax; + Copy the ‘tika-example.nnmodel’ to the default directory of tika\tika-core\test \resources\org\apache\tika\detect\, alternatively in your own version of TrainedModelDetector, you invoke getDefaultModel with a different model file location, the purpose of this method is to read the model files and load those models into memory as an object instance i.e. TrainedModel; If your model file(s) have a different syntax or format, you might need overwrite this method getDefaultModel to provide reading and loading implementation that respect your syntax; - It is also possible that your model might use different size of input of byte histograms, some might consider a different bin size with some heuristics specific to their own data, in that case, it is possible to overwrite the readByteFrequencies(final InputStream input) :: TrainedModelDetector by providing your own version of byte histograms, and you also need to ensure the model parameters are used and set to reflect the same size of input. + It is also possible that your model might use different size of input of byte histograms, some might consider a different bin size with some heuristics specific to their own data, in that case, it is possible to overwrite the readByteFrequencies(final InputStream input) :: TrainedModelDetector by providing your own version of byte histograms, and you also need to ensure the model parameters are used and set to reflect the same size of input. TrainedModelDetector implements the Detector interface, but it is abstract meaning we need to subclass it with our own version of TrainedModelDetector.
