[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Sun, 10 May 2015 14:33:23 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=12&rev2=13

- Tika - Content based MIME type Detection
+ = Tika - Content based MIME type Detection =
- 
  JIRA issue with the TIKA feature
  
  https://issues.apache.org/jira/browse/TIKA-1582
@@ -26, +25 @@

  
  Also please note the model has to be ready before it can be used in Tika; by 
"ready", we mean the model has to pass the final knowledge evaluation test. As 
shall be seen shortly, as an example Tika is only implementing the prediction 
phase, so the model parameters need to be loaded and read into Tika for 
prediction or classification; The process of training can be lengthy and 
tedious, sometimes training might require parallel computation on e.g. 
map-reduce when training data is too large to fit memory, again this depends on 
the user's goal.
  
- ''The following will briefly walk you through how the feature and example is 
implemented in this data problem.''
+ ''The following will briefly walk you through how the feature and example is 
implemented in this data problem. Please refer to the 
[[https://github.com/LukeLiush/filetypeDetection/blob/master/Documenation_NNModelIntegrationWithTika.docx
 |documentation]]for further details with the R implemenation.''
  
  Please also refer to the code repo for details of implementation for training 
or preparing for a model, the neural network and logistic regression learning 
are implemented in R and the following describes the pre processing and 
learning implementation in R.
  
@@ -48, +47 @@

  
  The dimensionality for each set is as follows.
  
- m*(256+1)
- 
- , where m indicates the number of training/validation/test examples; 256 is 
the size of features (i.e. byte frequency histogram with is '''not''' 
preprocessed with a companding function) + 1 for the labeled output.
+ ''m*(256+1),'' where m indicates the number of training/validation/test 
examples; 256 is the size of features (i.e. byte frequency histogram with is 
'''not''' preprocessed with a companding function) + 1 for the labeled output.
  
  All of the sets are treated as matrices which need to be saved as files; 
those files are loaded into the R program thru the ‘loadAndProcess.R’;

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to