[
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172527#comment-15172527
]
Thamme Gowda N commented on TIKA-1663:
--------------------------------------
[~chrismattmann] [[email protected]] We need SHA digest of raw content for
MEMEX project.
I tried to enable digesting parser by editing our config file:
{code}
<properties>
<parsers>
<parser class="org.apache.tika.parser.DigestingParser">
<parser class="org.apache.tika.parser.DefaultParser">
</parser>
</parser>
.....
{code}
This doesnt work for the obvious reason that we havent told which digest
algorithm.
After checking
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java,
I found that DigestingParser is a flexible framwork and takes constructor
args.
So, I propose two options:
1. We offer few popular implementations like SHA, MD5 parsers which doesnt need
constructor args. This will enable us to activate them by editing the config
xml file instead of source code.
2. We enhance tika configuration framework and these flexible parsers to accept
runtime arguments, so that the flexibility and ease of use is preserved. For
instance, if we can supply digest algorithm name from config file and let the
DigestingParser use it to instantiate, then we dont need to edit source code of
applications.
{code}
<properties>
<parsers>
<parser class="org.apache.tika.parser.DigestingParser">
<args>
<digest>MD5</digest>
</args>
<parser class="org.apache.tika.parser.DefaultParser">
</parser>
</parser>
.....
{code}
I vote for option 2 even though it is slightly more work, but I feel it is the
way to go.
I donot know if Tika already has a support for option 2 by accepting runtime
arguments from config file.
I faced a similar issue with NamedEntityParser, but found a workaround by
using System properties.
> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> -------------------------------------------------------------------
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)