[
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Giuseppe Totaro updated TIKA-1541:
----------------------------------
Description:
I thought to implement an extremely simple implementation of {{StringsParser}},
a parser based on the {{strings}} command (or {{strings}}-alternative command),
instead of using the dummy {{EmptyParser}} for undetected files. It is a
preliminary work (you can see a lot of todos). It is inspired by the work on
{{TesseractOCRParser}}. You can find the patch in attachment.
[file:////Users/gtotaro/Desktop/TIKA-1541.patch]
I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser]
for sharing the code. As first test, you can clone the repo, build the code
using the {{build.sh}} script, and then run the parser using the {{run.sh}}
script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files
(grabbed from "016" subset) detected as {{application/octet-stream}}. The
latter script launches a simple {{StringsTest}} class for testing.
I hope you will find the {{StringsParser}} a good solution for extracting ASCII
strings from undetected filetypes. As far as I understood, many "sophisticated"
forensics tools work in a similar manner for indexing purposes. They use a sort
of {{strings}} command against files that they are not able to detect.
In addition to run {{strings}} on undetected files, the {{StringsParser}}
launches the {{file}} command on undetected files and then writes the output in
the {{strings:file_output}} property (I noticed that sometimes the {{file}}
command is able to detect the media type for documents not detected by Tika).
Finally, you can fine an old discussion about this topic
[here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html].
Thanks [~chrismattmann].
was:
I thought to implement an extremely simple implementation of {{StringsParser}},
a parser based on the {{strings}} command (or {{strings}}-alternative command),
instead of using the dummy {{EmptyParser}} for undetected files. It is a
preliminary work (you can see a lot of todos). It is inspired by the work on
{{TesseractOCRParser}}.
I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser]
for sharing the code. As first test, you can clone the repo, build the code
using the {{build.sh}} script, and then run the parser using the {{run.sh}}
script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files
(grabbed from "016" subset) detected as {{application/octet-stream}}. The
latter script launches a simple {{StringsTest}} class for testing.
I hope you will find the {{StringsParser}} a good solution for extracting ASCII
strings from undetected filetypes. As far as I understood, many "sophisticated"
forensics tools work in a similar manner for indexing purposes. They use a sort
of {{strings}} command against files that they are not able to detect.
In addition to run {{strings}} on undetected files, the {{StringsParser}}
launches the {{file}} command on undetected files and then writes the output in
the {{strings:file_output}} property (I noticed that sometimes the {{file}}
command is able to detect the media type for documents not detected by Tika).
Finally, you can fine an old discussion about this topic
[here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html].
Thanks [~chrismattmann].
> StringsParser: a simple strings-based parser for Tika
> -----------------------------------------------------
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Giuseppe Totaro
> Attachments: TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of
> {{StringsParser}}, a parser based on the {{strings}} command (or
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}}
> for undetected files. It is a preliminary work (you can see a lot of todos).
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch
> in attachment.
> [file:////Users/gtotaro/Desktop/TIKA-1541.patch]
> I created a GitHub
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the
> code. As first test, you can clone the repo, build the code using the
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from
> "016" subset) detected as {{application/octet-stream}}. The latter script
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting
> ASCII strings from undetected filetypes. As far as I understood, many
> "sophisticated" forensics tools work in a similar manner for indexing
> purposes. They use a sort of {{strings}} command against files that they are
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}}
> launches the {{file}} command on undetected files and then writes the output
> in the {{strings:file_output}} property (I noticed that sometimes the
> {{file}} command is able to detect the media type for documents not detected
> by Tika).
> Finally, you can fine an old discussion about this topic
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html].
> Thanks [~chrismattmann].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)