Luke sh created TIKA-1582:
-----------------------------
Summary: Mime Detection based on neural networks with
Byte-frequency-histogram
Key: TIKA-1582
URL: https://issues.apache.org/jira/browse/TIKA-1582
Project: Tika
Issue Type: Improvement
Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
Content-based mime type detection is one of the popular approaches to detect
mime type, there are others based on file extension and magic numbers ; And
currently Tika has implemented 3 approaches in detecting mime types;
They are :
1) file extensions
2) magic numbers (the most trustworthy in tika)
3) content-type(the header in the http response if present and available)
Content-based mime type detection however analyses the distribution of the
entire stream of bytes and find a similar pattern for the same type and build a
function that is able to group them into one or several classes so as to
classify and predict; It is believed this feature might broaden the usage of
Tika with a bit more security enforcement for mime type detection. Because we
want to build a model that is etched with the patterns it has seen, in some
situations we may not trust those types which have not been trained/learned by
the model. In some situations, magic numbers imbedded in the files can be
copied but the actual content could be a potentially detrimental Troy program.
By enforcing the trust on byte frequency patterns, we are able to enhance the
security of the detection.
The proposed content-based mime detection to be integrated into Tika is based
on the machine learning algorithm i.e. neural network with back-propagation.
The input: 0-255 bins each of which represent a byte, and and each of which
stores the count of occurrences for each byte, and the byte frequency
histograms are normalized to fall in the range between 0 and 1, they then are
passed to a compounding function to enhancement the infrequent bytes.
The output of the neural network is a binary decision 1 or 0;
Notice BTW, the proposed feature will be implemented with GRB file type as one
example.
In this example, we build a model that is able to classify GRB file type from
non-GRB file types, notice the size of non-GRB files is huge and cannot be
easily defined, so there need to be as many negative training example as
possible to form this non-GRB types decision boundary.
The Neural networks is considered as two stage of processes.
Training and classification.
The training can be done in any programming language, in this feature
/research, the training of neural network is implemented in R and the source
can be found in my github repository i.e.
https://github.com/LukeLiush/filetypeDetection; i am also going to post a
document that describe the use of the program, the syntax/ format of the input
and output.
After training, we need to export the model and import it to Tika; in Tika, we
create a TrainedModelDetector that reads this model file with one or more model
parameters or several model files,so it can detect the mime types with the
model of those mime types. Details of the research and usage with this proposed
feature will be posted on my github shortly.
It is worth noting again that in this research we only worked out one model -
GRB as one example to demonstrate the use of this content-based mime detection.
One of the challenges again is that the non-GRB file types cannot be clearly
defined unless we feed our model with some example data for all of the existing
file types in the world, but this seems to be too utopian and a bit less
likely, so it is better that the set of class/types is given and defined in
advance to minimize the problem domain.
Another challenge is the size of the training data; even if we know the types
we want to classify, getting enough training data to form a model can be also
one of the main factors of success. In our example model, grb data are
collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the
grb data from that source all exhibit a similar pattern, a simple neural
network structure is able to predict well, even a linear logistic regression is
able to do a good job; However, if we pass the GRB files collected from other
source to the model for prediction, then we find out that the model predict
poorly and unexpectedly, so this bring up the aspect of whether we need to
include all training data or those are of interest, including all data is very
expensive so it is necessary to introduce some domain knowledge to minimize the
problem domain; we believe users should know what types they want to classify
and they should be able to get enough training data, although getting the
training data can be a tedious and expensive process. Again it is better to
have that domain knowledge with the set of types present in users' database and
train a model with some examples for every type in the database.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)