[
https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871623#action_12871623
]
Julien Nioche commented on TIKA-433:
------------------------------------
Could do. I can't see a place in Tika's code for non-core contributions /
sandbox though and am not sure that we want to burden Tika with Hadoop
dependencies just for the sake of implementing this. My comment was actually
more about the fact that functionalities such as the one you described *are*
what Behemoth is all about i.e. processing documents in various ways using
mapreduce, storing the data in a neutral, stand-off based implementation and
using that in conjunction with projects such as SOLR or Mahout.
I suppose it also depends on whether Tika's focus should be on its API or
provide a sandbox as well. WDYT?
> Tika + Hadoop
> -------------
>
> Key: TIKA-433
> URL: https://issues.apache.org/jira/browse/TIKA-433
> Project: Tika
> Issue Type: New Feature
> Components: general
> Reporter: Grant Ingersoll
> Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with
> "rich" documents on it and an output format (or output processor) and
> converted the docs to XHTML or Solr or whatever. Seems like it should be
> pretty straightforward to do on the Hadoop side of things. Only tricky part,
> I suppose, is the output format and how to make that pluggable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.