[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Apache Wiki Sun, 10 May 2015 13:30:18 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ContentMimeDetection" page has been changed by Lukeliush:
https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=2&rev2=3

  
  
  
- The feature of TIKA-1582 is an extension of TIKA MIME detection based on file 
contents, i.e. the file byte histograms, and this feature follows a standard 
data mining process that extract the knowledege out of the data (bytes). The 
motivation of this feature is to offer users with an option where content based 
detection approach can be used, the content can be defined in several ways, 
theycan be the entire file bytes, byte n-grams, byte histograms, etc. In this 
feature, the content-byte histogram is used, and the Content based detection 
analyses the byte histogram patterns (i.e. content).  Some files are very huge 
in size, building byte histograms for those files requires significant amount 
time, but it is worth noting that with domain specific knowledge or the 
heurstics (e.g. there might be some crutial and critical regions in the file 
that could help with the detection.), we can further reduce the amount of 
effort required for knowledge discovery or mining particular patterns that we 
can use in the type detection.
+ The feature of TIKA-1582 is an extension of TIKA MIME detection based on file 
contents, i.e. the file byte histograms, and this feature follows a standard 
data mining process that extracts the knowledge out of the data (bytes). The 
motivation of this feature is to offer users with an option where content based 
detection approach can be used, the content can be defined in several ways, 
they can be the entire file bytes, byte n-grams, byte histograms, etc. In this 
feature, the content byte histogram is used.  
+ 
+ Some files are very huge in size, building byte histograms for those files 
requires significant amount time, but it is worth noting that with domain 
specific knowledge or the heuristics (e.g. there might be some crucial and 
critical regions in the file that could help with the detection.), we can 
further reduce the amount of effort required for knowledge discovery or mining 
particular patterns that we can use in the type detection.
  
  Please also note, this content based mime detection does require users to 
have some knowledge with data mining and machine learning, and the choice of 
learning algorithms used in the pattern mining does not seem to matter, the 
knowledge to be mined is actually the classification, and there are many 
classification learning algorithms invented or revented, the question of which 
one is the best depends on a goal and data, each of the learning algorithms 
requires lots of effort for performance testing, and some data might be linear 
sepeartable, some are not; and a or a set of goals is very important as it 
often is in the context of performance tuning; we can also think about it as a 
performance tuning problem where we need to set a set of goals in terms of the 
scalability, complexity or accuracy, so we want to leave the choice of 
algorithms to users based on their goals and data in their enviroment. As an 
example, we have actually implemented two algorithms for mining patterns with 
the GRB file types, one is linear logistic regression and the other is neural 
network. Again, the neural network with back-propagation is a bit more complex 
with training, and logistic regression is far cheaper in terms of complexity, 
and it turns out that logistic regression also gives a good result with high 
accuracy, and it is worthy noting that it is always better to circumscribe the 
mime types to be detected; in the example model we have built, we attempt to 
classify grb files from non-grb files, and one of the challenges is to identify 
the non-grb file types whose class can be enormously large, the best practice 
is to circumscribe a set of types to be classified, again domain specific 
knowledge come into the play for well-defining a set of types in the user 
specific enviroment.

[Tika Wiki] Update of "ContentMimeDetection" by Lukeliush

Reply via email to