Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=18&rev2=19

  
  We need to split the dataset into 3 chunks, training set, validation set and 
test set.
  
- The dimensionality for each set is as follows.
+ We convert the stream of bytes to the histogram with 255 bins each of which 
stores a count of Occurances, (you can define your input with smaller number of 
histogram bins or the selected bins based on the domain knowledge, you can also 
apply a feature selection algorithm such as SOM or PCA, and you can apply own 
custom functions on the input variables for the model to have non-linear 
effect, but to begin with, we need to understand our goal and the data, 
sometimes we need to visualize the data and usually we start with some simple 
algorithm to explore the data and then decide whether a more complex algorithm 
is needed).
  
+ For simplicity, our training data have the following 255 features each of 
which correspond to a byte, and each training example is labelled with an 
actual output indicating its class.
- ''m*(256+1),'' where m indicates the number of training/validation/test 
examples; 256 is the size of features (i.e. byte frequency histogram with is 
'''not''' preprocessed with a companding function) + 1 for the labeled output.
- 
- All of the sets are treated as matrices which need to be saved as files; 
those files are loaded into the R program thru the ‘loadAndProcess.R’;
  
  '''Pre processing'''
  
@@ -177, +175 @@

  
  The following lists all of the classes for this feature 
(tika\tika-core\src\main\java)
  
+  . org.apache.tika.detect.TrainedModelDetector (abstract) 
org.apache.tika.detect.ExampleNNModelDetector
+  org.apache.tika.detect.TrainedModel (abstract) 
org.apache.tika.detect.NNTrainedModel
-   org.apache.tika.detect.TrainedModelDetector (abstract)
- 
-   org.apache.tika.detect.ExampleNNModelDetector
- 
-   org.apache.tika.detect.TrainedModel (abstract)
- 
-   org.apache.tika.detect.NNTrainedModel
  
  Example model file (tika\tika-core\src\main\resources)
  
-   org.apache.tika.detect.tika-example.nnmodel
+  . org.apache.tika.detect.tika-example.nnmodel
  
  Unit test (tika\tika-core\src\test\java)
  
-   org.apache.tika.detect. MimeDetectionWithNNTest
+  . org.apache.tika.detect. MimeDetectionWithNNTest
  

Reply via email to