Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=23&rev2=24

  
  '''Motivation'''
  
- The feature of TIKA-1582 is an extension of TIKA MIME detection based on file 
contents, i.e. the file byte histograms, and this feature provides a solution 
that follows a standard data mining process that extracts the knowledge out of 
the data (bytes). The motivation of this feature is to provide users with an 
option where content based detection approach can be used, the contents can be 
defined in several ways, they can be the entire file bytes, byte n-grams, byte 
histograms, etc. In this feature, the content byte histogram is used.
+ The feature of TIKA-1582 is an extension of TIKA MIME detection based on file 
contents, i.e. the file byte histograms, and this feature provides a solution 
that follows a standard data mining process that extracts the knowledge out of 
the data (bytes). The motivation of this feature is to provide users with an 
option where content-based detection approach can be used, the "contents" can 
be defined in several ways, they can be the entire file bytes, byte n-grams, 
byte histograms, etc. In this feature, the byte-histogram is used as an example.
  
  Some files are very huge in size, building byte histograms for those files 
requires significant amount of time, but it is worth noting that with domain 
specific knowledge or the heuristics (e.g. there might be some crucial and 
critical regions in the file that could help with the detection.), we can 
further reduce the amount of effort required for knowledge discovery or mining 
particular patterns that we can use in the type detection.
  

Reply via email to