[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

Chris A. Mattmann (JIRA) Fri, 06 Feb 2015 18:30:06 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310470#comment-14310470
 ]


Chris A. Mattmann commented on TIKA-1541:
-----------------------------------------

Right [~lfcnassif] and on TIKA-1483, it's just amounted to discussion so far, 
and what Giuseppe has done is to write code. It's a start and can definitely be 
improved upon. If there are improvements to this code to make it pure Java, 
great, we can consider them when they are written. Right now, this is code 
that's here and a good starting point. I'll work on getting this into the 
sources.

As for having an external process to extract Strings not being hard enough for 
an external process, I'm not sure what you mean. UNIX strings is more than 
simply an external process - it's a well maintained and long standing external 
GNU tool - why not simply have a parser that integrates it and demonstrates yet 
another example of the ExternalParser API? I think it's a good thing to have.

> StringsParser: a simple strings-based parser for Tika
> -----------------------------------------------------
>
>                 Key: TIKA-1541
>                 URL: https://issues.apache.org/jira/browse/TIKA-1541
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>         Attachments: TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

Reply via email to