Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=15&rev2=16

  
  (It is worth noting that the feature selection requires learning the 
application domain which in our case is specific to the user domain and 
environment)
  
- Also please note the model has to be ready before it can be used in Tika; by 
"ready", we mean the model has to pass the final knowledge evaluation test. As 
shall be seen shortly, as an example Tika is only implementing the prediction 
phase, so the model parameters need to be loaded and read into Tika for 
prediction or classification; The process of training can be lengthy and 
tedious, sometimes training might require parallel computation on e.g. 
map-reduce when training data is too large to fit memory, again this depends on 
the user's goal.
+ Also please note the model has to be ready before it can be used in Tika; by 
"ready", we mean the model has to pass the final knowledge evaluation test. As 
shall be seen shortly, as an example Tika is only implementing the prediction 
phase, so the model parameters need to be loaded and read into Tika for 
prediction or classification; The process of training can be lengthy and 
tedious, sometimes training might need to be converted to map-reduce operations 
when training data is too large to fit memory, again this depends on the user's 
goal.
  
- ''The following will briefly walk you through how the feature and example is 
implemented in this data problem. Please also refer to the attached docx for 
further information with the implemenation in R.<<BR>>''
+ __''The following will briefly walk you through how the feature and example 
is implemented in this data problem. Please also refer to the attached docx for 
further information with the implemenation in R.''__
  
- Please also refer to the code repo for details of implementation for training 
or preparing for a model, the neural network and logistic regression learning 
are implemented in R and the following describes the pre processing and 
learning implementation in R.
+ Please also refer to the code repo for details of implementation for training 
a model, the neural network and logistic regression learning are all 
implemented in R and the following briefly describes the pre-processing and 
learning implementation in R and how to load the model parameters trained from 
the R programs into the Tika for mime detection.
+ 
+ Please note again, the training program can be created or written in any 
programming language, the R implemenation is posted as an example, Tika only 
needs to load the well-trained model parameters from the training program and 
be able to use them. The job of the feature in Tika generally have 4 steps as 
follow, and also it is flexible that you can overwrite the detect method of the 
TrainedModelDetector to define your own selected features if you have different 
features defined in your training.
+ 
+  1. read the input in bytes
+  1. convert it to the byte histogram
+  1. preprocess and transform the histogram
+  1. predict the decision.
  
  Project source repository
  

Reply via email to