[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

Nick Burch (JIRA) Sat, 07 Feb 2015 05:35:51 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310711#comment-14310711
 ]


Nick Burch commented on TIKA-1541:
----------------------------------

I'm not sure if we want to be activating this by default for 
application/octet-stream - it could surprise users of "unsupported" files, 
could add to the processing time and memory of those files, and means that we'd 
have the unexpected case that "known but unsupported mime type" would have less 
returned than "unknown mime type"!

Once we have the different parser strategy stuff in place (see TIKA-1509), I 
could very much see this being great as a default in the "give me all you can" 
situation (in place of the current EmptyParser)

Until we have that in place, I think we probably ought to not register it in 
the parsers list. Otherwise, people will suddenly find processing times go up, 
and those who've put time into getting mime types defined + detected for their 
parser-less formats will end up worse off than those who haven't

> StringsParser: a simple strings-based parser for Tika
> -----------------------------------------------------
>
>                 Key: TIKA-1541
>                 URL: https://issues.apache.org/jira/browse/TIKA-1541
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

Reply via email to