[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

Chris A. Mattmann (JIRA) Tue, 10 Feb 2015 15:30:06 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315173#comment-14315173
 ]


Chris A. Mattmann commented on TIKA-1541:
-----------------------------------------

Last patch did the trick, [~gostep].

{noformat}
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ tika ---
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---
[INFO] 
[INFO] --- forbiddenapis:1.7:check (default) @ tika ---
[INFO] Skipping execution for packaging "pom"
[INFO] 
[INFO] --- forbiddenapis:1.7:testCheck (default) @ tika ---
[INFO] Skipping execution for packaging "pom"
[INFO] 
[INFO] --- maven-site-plugin:3.0:attach-descriptor (attach-descriptor) @ tika 
---
[INFO] 
[INFO] --- maven-install-plugin:2.3.1:install (default-install) @ tika ---
[INFO] Installing /Users/mattmann/tmp/tika/pom.xml to 
/Users/mattmann/.m2/repository/org/apache/tika/tika/1.8-SNAPSHOT/tika-1.8-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  2.130 s]
[INFO] Apache Tika core ................................... SUCCESS [ 25.001 s]
[INFO] Apache Tika parsers ................................ SUCCESS [02:00 min]
[INFO] Apache Tika XMP .................................... SUCCESS [  2.239 s]
[INFO] Apache Tika serialization .......................... SUCCESS [  2.009 s]
[INFO] Apache Tika application ............................ SUCCESS [ 14.115 s]
[INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 17.591 s]
[INFO] Apache Tika server ................................. SUCCESS [ 19.313 s]
[INFO] Apache Tika translate .............................. SUCCESS [  2.305 s]
[INFO] Apache Tika examples ............................... SUCCESS [  5.266 s]
[INFO] Apache Tika Java-7 Components ...................... SUCCESS [  2.547 s]
[INFO] Apache Tika ........................................ SUCCESS [  0.034 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:33 min
[INFO] Finished at: 2015-02-10T15:25:24-08:00
[INFO] Final Memory: 81M/1618M
[INFO] ------------------------------------------------------------------------
[chipotle:~/tmp/tika] mattmann% 
{noformat}

committing now great work!

> StringsParser: a simple strings-based parser for Tika
> -----------------------------------------------------
>
>                 Key: TIKA-1541
>                 URL: https://issues.apache.org/jira/browse/TIKA-1541
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmannBurchNassif.020715.patch, 
> TIKA-1541.TotaroMattmannBurchNassif.020815.patch, 
> TIKA-1541.TotaroMattmannBurchNassif.020915.patch, TIKA-1541.patch, 
> testOCTET_header.dbase3
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

Reply via email to