Luke sh created TIKA-1582:
-----------------------------

             Summary: Mime Detection based on neural networks with 
Byte-frequency-histogram 
                 Key: TIKA-1582
                 URL: https://issues.apache.org/jira/browse/TIKA-1582
             Project: Tika
          Issue Type: Improvement
          Components: detector, mime
    Affects Versions: 1.7
            Reporter: Luke sh
            Priority: Trivial


Content-based mime type detection is one of the popular approaches to detect 
mime type, there are others based on file extension and magic numbers ; And 
currently Tika has implemented 3 approaches in detecting mime types; 
They are :
1) file extensions
2) magic numbers (the most trustworthy in tika)
3) content-type(the header in the http response if present and available)

Content-based mime type detection however analyses the distribution of the 
entire stream of bytes and find a similar pattern for the same type and build a 
function that is able to group them into one or several classes so as to 
classify and predict; It is believed this feature might broaden the usage of 
Tika with a bit more security enforcement for mime type detection. Because we 
want to build a model that is etched with the patterns it has seen, in some 
situations we may not trust those types which have not been trained/learned by 
the model. In some situations, magic numbers imbedded in the files can be 
copied but the actual content could be a potentially detrimental Troy program. 
By enforcing the trust on byte frequency patterns, we are able to enhance the 
security of the detection.

The proposed content-based mime detection to be integrated into Tika is based 
on the machine learning algorithm i.e. neural network with back-propagation. 

The input: 0-255 bins each of which represent a byte, and and each of which 
stores the count of occurrences for each byte, and the byte frequency 
histograms are normalized to fall in the range between 0 and 1, they then are 
passed to a compounding function to enhancement the infrequent bytes.
The output of the neural network is a binary decision 1 or 0;

Notice BTW, the proposed feature will be implemented with GRB file type as one 
example.

In this example, we build a model that is able to classify GRB file type from 
non-GRB file types, notice the size of non-GRB files is huge and cannot be 
easily defined, so there need to be as many negative training example as 
possible to form this non-GRB types decision boundary.

The Neural networks is considered as two stage of processes.
Training and classification.

The training can be done in any programming language, in this feature 
/research, the training of neural network is implemented in R and the source 
can be found in my github repository i.e. 
https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
document that describe the use of the program, the syntax/ format of the input 
and output.

After training, we need to export the model and import it to Tika; in Tika, we 
create a TrainedModelDetector that reads this model file with one or more model 
parameters or several model files,so it can detect the mime types with the 
model of those mime types. Details of the research and usage with this proposed 
feature will be posted on my github shortly.

It is worth noting again that in this research we only worked out one model - 
GRB as one example to demonstrate the use of this content-based mime detection. 
One of the challenges again is that the non-GRB file types cannot be clearly 
defined unless we feed our model with some example data for all of the existing 
file types in the world, but this seems to be too utopian and a bit less 
likely, so it is better that the set of class/types is given and defined in 
advance to minimize the problem domain. 

Another challenge is the size of the training data; even if we know the types 
we want to classify, getting enough training data to form a model can be also 
one of the main factors of success. In our example model, grb data are 
collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
grb data from that source all exhibit a similar pattern, a simple neural 
network structure is able to predict well, even a linear logistic regression is 
able to do a good job; However, if we pass the GRB files collected from other 
source to the model for prediction, then we find out that the model predict 
poorly and unexpectedly, so this bring up the aspect of whether we need to 
include all training data or those are of interest, including all data is very 
expensive so it is necessary to introduce some domain knowledge to minimize the 
problem domain; we believe users should know what types they want to classify 
and they should be able to get enough training data, although getting the 
training data can be a tedious and expensive process. Again it is better to 
have that domain knowledge with the set of types present in users' database and 
train a model with some examples for every type in the database.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to