Giuseppe Totaro created TIKA-1541:
-------------------------------------
Summary: StringsParser: a simple strings-based parser for Tika
Key: TIKA-1541
URL: https://issues.apache.org/jira/browse/TIKA-1541
Project: Tika
Issue Type: Improvement
Components: parser
Reporter: Giuseppe Totaro
I thought to implement an extremely simple implementation of {{StringsParser}},
a parser based on the {{strings}} command (or {{strings}}-alternative command),
instead of using the dummy {{EmptyParser}} for undetected files. It is a
preliminary work (you can see a lot of todos). It is inspired by the work on
{{TesseractOCRParser}}.
I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser]
for sharing the code. As first test, you can clone the repo, build the code
using the {{build.sh}} script, and then run the parser using the {{run.sh}}
script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files
(grabbed from "016" subset) detected as {{application/octet-stream}}. The
latter script launches a simple {{StringsTest}} class for testing.
I hope you will find the {{StringsParser}} a good solution for extracting ASCII
strings from undetected filetypes. As far as I understood, many "sophisticated"
forensics tools work in a similar manner for indexing purposes. They use a sort
of {{strings}} command against files that they are not able to detect.
In addition to run {{strings}} on undetected files, the {{StringsParser}}
launches the {{file}} command on undetected files and then writes the output in
the {{strings:file_output}} property (I noticed that sometimes the {{file}}
command is able to detect the media type for documents not detected by Tika).
Finally, you can fine an old discussion about this topic
[here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html].
Thanks [~chrismattmann].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)