[ https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Giuseppe Totaro updated TIKA-1541: ---------------------------------- Description: I thought to implement an extremely simple implementation of {{StringsParser}}, a parser based on the {{strings}} command (or {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on {{TesseractOCRParser}}. You can find the patch in attachment. [file:////Users/gtotaro/Desktop/TIKA-1541.patch] I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the code. As first test, you can clone the repo, build the code using the {{build.sh}} script, and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from "016" subset) detected as {{application/octet-stream}}. The latter script launches a simple {{StringsTest}} class for testing. I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many "sophisticated" forensics tools work in a similar manner for indexing purposes. They use a sort of {{strings}} command against files that they are not able to detect. In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the {{file}} command on undetected files and then writes the output in the {{strings:file_output}} property (I noticed that sometimes the {{file}} command is able to detect the media type for documents not detected by Tika). Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. Thanks [~chrismattmann]. was: I thought to implement an extremely simple implementation of {{StringsParser}}, a parser based on the {{strings}} command (or {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on {{TesseractOCRParser}}. I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the code. As first test, you can clone the repo, build the code using the {{build.sh}} script, and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from "016" subset) detected as {{application/octet-stream}}. The latter script launches a simple {{StringsTest}} class for testing. I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many "sophisticated" forensics tools work in a similar manner for indexing purposes. They use a sort of {{strings}} command against files that they are not able to detect. In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the {{file}} command on undetected files and then writes the output in the {{strings:file_output}} property (I noticed that sometimes the {{file}} command is able to detect the media type for documents not detected by Tika). Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. Thanks [~chrismattmann]. > StringsParser: a simple strings-based parser for Tika > ----------------------------------------------------- > > Key: TIKA-1541 > URL: https://issues.apache.org/jira/browse/TIKA-1541 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Giuseppe Totaro > Attachments: TIKA-1541.patch > > > I thought to implement an extremely simple implementation of > {{StringsParser}}, a parser based on the {{strings}} command (or > {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} > for undetected files. It is a preliminary work (you can see a lot of todos). > It is inspired by the work on {{TesseractOCRParser}}. You can find the patch > in attachment. > [file:////Users/gtotaro/Desktop/TIKA-1541.patch] > I created a GitHub > [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the > code. As first test, you can clone the repo, build the code using the > {{build.sh}} script, and then run the parser using the {{run.sh}} script on > some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from > "016" subset) detected as {{application/octet-stream}}. The latter script > launches a simple {{StringsTest}} class for testing. > I hope you will find the {{StringsParser}} a good solution for extracting > ASCII strings from undetected filetypes. As far as I understood, many > "sophisticated" forensics tools work in a similar manner for indexing > purposes. They use a sort of {{strings}} command against files that they are > not able to detect. > In addition to run {{strings}} on undetected files, the {{StringsParser}} > launches the {{file}} command on undetected files and then writes the output > in the {{strings:file_output}} property (I noticed that sometimes the > {{file}} command is able to detect the media type for documents not detected > by Tika). > Finally, you can fine an old discussion about this topic > [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. > Thanks [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)