Giuseppe Totaro created TIKA-1541:
-------------------------------------

             Summary: StringsParser: a simple strings-based parser for Tika
                 Key: TIKA-1541
                 URL: https://issues.apache.org/jira/browse/TIKA-1541
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Giuseppe Totaro


I thought to implement an extremely simple implementation of {{StringsParser}}, 
a parser based on the {{strings}} command (or {{strings}}-alternative command), 
instead of using the dummy {{EmptyParser}} for undetected files. It is a 
preliminary work (you can see a lot of todos). It is inspired by the work on 
{{TesseractOCRParser}}.

I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] 
for sharing the code. As first test, you can clone the repo, build the code 
using the {{build.sh}} script, and then run the parser using the {{run.sh}} 
script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files 
(grabbed from "016" subset) detected as {{application/octet-stream}}. The 
latter script launches a simple {{StringsTest}} class for testing.

I hope you will find the {{StringsParser}} a good solution for extracting ASCII 
strings from undetected filetypes. As far as I understood, many "sophisticated" 
forensics tools work in a similar manner for indexing purposes. They use a sort 
of {{strings}} command against files that they are not able to detect.

In addition to run {{strings}} on undetected files, the {{StringsParser}} 
launches the {{file}} command on undetected files and then writes the output in 
the {{strings:file_output}} property (I noticed that sometimes the {{file}} 
command is able to detect the media type for documents not detected by Tika).

Finally, you can fine an old discussion about this topic 
[here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to