[
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311832#comment-14311832
]
Chris A. Mattmann commented on TIKA-1541:
-----------------------------------------
Thanks [~gostep] I tried this out and found that it fails the forbidden API
checker Maven plugin:
{noformat}
[INFO] --- maven-failsafe-plugin:2.10:verify (default) @ tika-core ---
[INFO] Failsafe report directory:
/Users/mattmann/tmp/tika/tika-core/target/failsafe-reports
[INFO]
[INFO] --- maven-install-plugin:2.3.1:install (default-install) @ tika-core ---
[INFO] Reading bundled API signatures: jdk-deprecated
[INFO] Loading classes to check...
[INFO] Scanning for API signatures and dependencies...
[ERROR] Forbidden method invocation:
java.io.InputStreamReader#<init>(java.io.InputStream) [Uses default charset]
[ERROR] in org.apache.tika.parser.strings.StringsParser
(StringsParser.java:234)
[ERROR] Forbidden method invocation:
java.io.InputStreamReader#<init>(java.io.InputStream) [Uses default charset]
[ERROR] in org.apache.tika.parser.strings.StringsParser
(StringsParser.java:287)
[ERROR] Scanned 287 (and 816 related) class file(s) for forbidden API
invocations (in 0.34s), 2 error(s).
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Tika parent ................................. SUCCESS [ 1.629 s]
[INFO] Apache Tika core ................................... SUCCESS [ 18.495 s]
[INFO] Apache Tika parsers ................................ FAILURE [ 4.611 s]
[INFO] Apache Tika XMP .................................... SKIPPED
[INFO] Apache Tika serialization .......................... SKIPPED
[INFO] Apache Tika application ............................ SKIPPED
[INFO] Apache Tika OSGi bundle ............................ SKIPPED
[INFO] Apache Tika server ................................. SKIPPED
[INFO] Apache Tika translate .............................. SKIPPED
[INFO] Apache Tika examples ............................... SKIPPED
[INFO] Apache Tika Java-7 Components ...................... SKIPPED
[INFO] Apache Tika ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 25.845 s
[INFO] Finished at: 2015-02-08T21:34:50-08:00
[INFO] Final Memory: 56M/705M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal de.thetaphi:forbiddenapis:1.7:check (default) on
project tika-parsers: Check for forbidden API calls failed, see log. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please
read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <goals> -rf :tika-parsers
[wildcard:~/tmp/tika] mattmann%
{noformat}
Let me know if you can update. Thanks!
> StringsParser: a simple strings-based parser for Tika
> -----------------------------------------------------
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Giuseppe Totaro
> Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt,
> TIKA-1541.TotaroMattmann.020615.patch.txt,
> TIKA-1541.TotaroMattmannBurchNassif.020715.patch, TIKA-1541.patch,
> testOCTET_header.dbase3
>
>
> I thought to implement an extremely simple implementation of
> {{StringsParser}}, a parser based on the {{strings}} command (or
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}}
> for undetected files. It is a preliminary work (you can see a lot of todos).
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch
> in attachment.
> I created a GitHub
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the
> code. As first test, you can clone the repo, build the code using the
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from
> "016" subset) detected as {{application/octet-stream}}. The latter script
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting
> ASCII strings from undetected filetypes. As far as I understood, many
> "sophisticated" forensics tools work in a similar manner for indexing
> purposes. They use a sort of {{strings}} command against files that they are
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}}
> launches the {{file}} command on undetected files and then writes the output
> in the {{strings:file_output}} property (I noticed that sometimes the
> {{file}} command is able to detect the media type for documents not detected
> by Tika).
> Finally, you can fine an old discussion about this topic
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html].
> Thanks [~chrismattmann].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)