Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=7&rev2=8

  
  
  
-   '''Use of the knowledge'''
+ '''Use of the knowledge'''
  
- '''Output the model <<BR>>'''
+ ''''Output the model ''''
  
  
  
@@ -187, +187 @@

  
  
  
-   When finishing neural network training, in the end the model parameters and 
configuration (e.g. number of input units, hidden units, etc) are written in a 
text file called ‘tika-example.nnmodel’ in the same directory with ‘main.R’;
+ When finishing neural network training, in the end the model parameters and 
configuration (e.g. number of input units, hidden units, etc) are written in a 
text file called ‘tika-example.nnmodel’ in the same directory with ‘main.R’;
  
-   As we need to copy this file to Tika to allow Tika to detect the type for 
which the model is trained e.g. GRB type,  note you can create many models for 
many different mime types, but GRB file type detection is discussed and used as 
one example to demonstrate the use.
+ As we need to copy this file to Tika to allow Tika to detect the type for 
which the model is trained e.g. GRB type,  note you can create many models for 
many different mime types, but GRB file type detection is discussed and used as 
one example to demonstrate the use.
  
-   The following line in main.R is the last line used to output the model, the 
name and structure can be customized according to different relish.
+ The following line in main.R is the last line used to output the model, the 
name and structure can be customized according to different relish.
  
    
  
-   The exportNNParams method implementation resides in the utility class i.e. 
‘myfunctions.R’; it can be also customized or replaced to create your own model 
file with different syntax or structure.
+ The exportNNParams method implementation resides in the utility class i.e. 
‘myfunctions.R’; it can be also customized or replaced to create your own model 
file with different syntax or structure.
  
-   The following shows what the outputted model look like in that model text 
file.
+ The following shows what the outputted model look like in that model text 
file.
  
-   The first line begins with # which indicates that this line is a model 
description that tells the type to be classified, the number of inputs, number 
of hidden units, output units and test set error cost; they are delimited by a 
tab.
+ The first line begins with # which indicates that this line is a model 
description that tells the type to be classified, the number of inputs, number 
of hidden units, output units and test set error cost; they are delimited by a 
tab.
  
-   The next line without # at the front shows a series of floating numbers 
separated by a tab, and they are model parameters, later we need to import the 
file into Tika and have the ExampleNNModelDetector to recreate the trained 
model with them in Tika so it can predict and classify the unseen file and 
determine with the imported model whether the given input file is a GRB or 
non-GRB type.
+ The next line without # at the front shows a series of floating numbers 
separated by a tab, and they are model parameters, later we need to import the 
file into Tika and have the ExampleNNModelDetector to recreate the trained 
model with them in Tika so it can predict and classify the unseen file and 
determine with the imported model whether the given input file is a GRB or 
non-GRB type.
  
    
  
-   The following shows the printing formation produced by the R program after 
training in a bit more detail with the outputted/chosen model above.
+ The following shows the printing formation produced by the R program after 
training in a bit more detail with the outputted/chosen model above.
  
    [1] "Loading Dataset....."
  
@@ -231, +231 @@

  
  
  
-   '''Import the model into Tika '''
+ ''''Import the model into Tika ''''
  
  
  
  Once the training is done, there is a model file that is generated as 
mentioned above. The above model file only have one model, however you can have 
multiple models written in that file or you can have several model files 
according to your needs.
  
-   Copy the ‘tika-example.nnmodel’ to the default directory of 
tika\tika-core\test \resources\org\apache\tika\detect\, alternatively in your 
own version of TrainedModelDetector, you invoke getDefaultModel with a 
different model file location, the purpose of this method is to read the model 
files and load those models into memory as an object instance i.e. 
TrainedModel;  If your model file(s) have a different syntax or format, you 
might need overwrite this method getDefaultModel to provide reading and loading 
implementation that respect your syntax;
+ Copy the ‘tika-example.nnmodel’ to the default directory of 
tika\tika-core\test \resources\org\apache\tika\detect\, alternatively in your 
own version of TrainedModelDetector, you invoke getDefaultModel with a 
different model file location, the purpose of this method is to read the model 
files and load those models into memory as an object instance i.e. 
TrainedModel;  If your model file(s) have a different syntax or format, you 
might need overwrite this method getDefaultModel to provide reading and loading 
implementation that respect your syntax;
  
-   It is also possible that your model might use different size of input of 
byte histograms, some might consider a different bin size with some heuristics 
specific to their own data, in that case, it is possible to overwrite the 
readByteFrequencies(final InputStream input) :: TrainedModelDetector by 
providing your own version of byte histograms, and you also need to ensure the 
model parameters are used and set to reflect the same size of input.
+ It is also possible that your model might use different size of input of byte 
histograms, some might consider a different bin size with some heuristics 
specific to their own data, in that case, it is possible to overwrite the 
readByteFrequencies(final InputStream input) :: TrainedModelDetector by 
providing your own version of byte histograms, and you also need to ensure the 
model parameters are used and set to reflect the same size of input.
  
    TrainedModelDetector implements the Detector interface, but it is abstract 
meaning we need to subclass it with our own version of TrainedModelDetector.
  

Reply via email to