[ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383307#comment-14383307
 ] 

Luke sh commented on TIKA-1582:
-------------------------------

Thanks a lot Nick for the prompt response.

[Nick]: Have you tried this on Container and/or Compressed file formats?
(eg .doc, .xlsx, .ods, .pages, .ogv, .mp4)
[Luke]: No i have not :-(, but the idea is different; I have been only
focusing on the tests and research with GRB file types; from the tests it
is worth noting again it is too utopian to come up with one perfect model
that is able to classify everything correctly; even with GRB file types i
notice that my trained GRB model cannot well predict the types of the files
collected from other source than the AMD polar site where my training data
is primarily collected.

Again the idea is different, i believe if the users know all the types they
are dealing with, they might be able to build a model that is specific to
their own data set, that means the model only needs to trust the histograms
with the patterns it has seen or the patterns that fall into its positive
decision region. I also want to mention that the size of the entire file
types are very huge, i only worked out one model (i.e. GRB type
classification with training data collected from AMD polar site) as one
example, users might need to train their own model with their own data,
again i have to mention it is too utopian to come up with a super model,
the idea with this feature is to minimize the problem domain. Note getting
training data can be a expensive and costly operation. Besides, even if we
have all the files in the world (impossible though - some data cost money
and some are private), training would be very expensive... one of the
solution is to apply specific domain knowledge to minimize the problem
domain, a simple question is that if a user only wants to securely detect 2
types and they only have two types in their data set to be classified, why
not bother to use and train  models for zip or mp4 types... I think this is
probably the motivation behind the development of this feature, not sure
this feature will become popular but it would be nice to have Tika to
support a content-based mime detection, please kindly let me know your
thoughts.


In our data set where our model is being trained, we find out that the GRB
data all present a similar pattern and each GRB histogram varies a little,
even they vary much as long as we have enough training data that covers all
of the patterns, i believe neural network is able to predict well with some
tuning effort. Again the downside with this approach is that this model is
limited in its prediction capability that only excel in the data
distribution/patterns it has seen; So our model is myopic because of the
data... but on the other hand this also probably make the detection a bit
secure; as intuitively hackers might not want to spend too much time in
mimicking the similar byte histogram in order to trick the mime type
detector, but they can easily camouflage their virus with the unharmful
extension and magic bytes.


[Nick]That said, I could see it working for some formats, so I'd be keen to
see the results of trying it on a large dataset (eg govdocs). Maybe it'd be
worth adding it into Tika Batch and trying a large run to see how it
performs?

[Luke]: Thanks Nick, i am also working on a document that describes this
feature in a bit more detail, it is coming soon. I will also attach some
test results with the grb model conducted over the past few weeks. Your
advice and comments are helpful, please do not hesitate to let me know your
thoughts and questions. thanks a lot again.





> Mime Detection based on neural networks with Byte-frequency-histogram 
> ----------------------------------------------------------------------
>
>                 Key: TIKA-1582
>                 URL: https://issues.apache.org/jira/browse/TIKA-1582
>             Project: Tika
>          Issue Type: Improvement
>          Components: detector, mime
>    Affects Versions: 1.7
>            Reporter: Luke sh
>            Priority: Trivial
>
> Content-based mime type detection is one of the popular approaches to detect 
> mime type, there are others based on file extension and magic numbers ; And 
> currently Tika has implemented 3 approaches in detecting mime types; 
> They are :
> 1) file extensions
> 2) magic numbers (the most trustworthy in tika)
> 3) content-type(the header in the http response if present and available) 
> Content-based mime type detection however analyses the distribution of the 
> entire stream of bytes and find a similar pattern for the same type and build 
> a function that is able to group them into one or several classes so as to 
> classify and predict; It is believed this feature might broaden the usage of 
> Tika with a bit more security enforcement for mime type detection. Because we 
> want to build a model that is etched with the patterns it has seen, in some 
> situations we may not trust those types which have not been trained/learned 
> by the model. In some situations, magic numbers imbedded in the files can be 
> copied but the actual content could be a potentially detrimental Troy 
> program. By enforcing the trust on byte frequency patterns, we are able to 
> enhance the security of the detection.
> The proposed content-based mime detection to be integrated into Tika is based 
> on the machine learning algorithm i.e. neural network with back-propagation. 
> The input: 0-255 bins each of which represent a byte, and and each of which 
> stores the count of occurrences for each byte, and the byte frequency 
> histograms are normalized to fall in the range between 0 and 1, they then are 
> passed to a companding function to enhancement the infrequent bytes.
> The output of the neural network is a binary decision 1 or 0;
> Notice BTW, the proposed feature will be implemented with GRB file type as 
> one example.
> In this example, we build a model that is able to classify GRB file type from 
> non-GRB file types, notice the size of non-GRB files is huge and cannot be 
> easily defined, so there need to be as many negative training example as 
> possible to form this non-GRB types decision boundary.
> The Neural networks is considered as two stage of processes.
> Training and classification.
> The training can be done in any programming language, in this feature 
> /research, the training of neural network is implemented in R and the source 
> can be found in my github repository i.e. 
> https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
> document that describe the use of the program, the syntax/ format of the 
> input and output.
> After training, we need to export the model and import it to Tika; in Tika, 
> we create a TrainedModelDetector that reads this model file with one or more 
> model parameters or several model files,so it can detect the mime types with 
> the model of those mime types. Details of the research and usage with this 
> proposed feature will be posted on my github shortly.
> It is worth noting again that in this research we only worked out one model - 
> GRB as one example to demonstrate the use of this content-based mime 
> detection. One of the challenges again is that the non-GRB file types cannot 
> be clearly defined unless we feed our model with some example data for all of 
> the existing file types in the world, but this seems to be too utopian and a 
> bit less likely, so it is better that the set of class/types is given and 
> defined in advance to minimize the problem domain. 
> Another challenge is the size of the training data; even if we know the types 
> we want to classify, getting enough training data to form a model can be also 
> one of the main factors of success. In our example model, grb data are 
> collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
> grb data from that source all exhibit a similar pattern, a simple neural 
> network structure is able to predict well, even a linear logistic regression 
> is able to do a good job; However, if we pass the GRB files collected from 
> other source to the model for prediction, then we find out that the model 
> predict poorly and unexpectedly, so this bring up the aspect of whether we 
> need to include all training data or those are of interest, including all 
> data is very expensive so it is necessary to introduce some domain knowledge 
> to minimize the problem domain; we believe users should know what types they 
> want to classify and they should be able to get enough training data, 
> although getting the training data can be a tedious and expensive process. 
> Again it is better to have that domain knowledge with the set of types 
> present in users' database and train a model with some examples for every 
> type in the database.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to