Hi Karl,
We (in Zaizi) had also this requirement. We initially addressed it by
creating a sort of "Processor Connector" mainly for semantically
enhancing the repository documents before indexing them. We would be
very happy to give this a try and provide feedback because our current
approach is totally temporal. Apart from processing the document, we
also had an special requirement that is to produce different instances
of repository documents because we populate more than one index at the
same time. We would need to check also how we can do exactly the same
with this processing pipeline.
Apart from this Karl, we can also take care of the Tika integration
(actually we already did it) and eventually take care of CONNECTORS-954
then. Because we already use Tika as "processor connector", we are going
to also modify the solr connector for not using the extract update
handler which present some problems also. Would that be interesting also
for the community?
Cheers,
Rafa
El 11/06/14 16:09, Karl Wright escribió:
Hi folks,
ManifoldCF finally has a pipeline! All tests pass. Now I'm looking for
people to try things out by hand to see if there are any rough edges,
before we get too far along in the 1.7 development cycle to fix them.
Trunk has all the necessary moving parts and documentation as well. There
are two transformation connectors available -- one that does nothing but
pass data through, and one that forces metadata (just like the framework
"Forced metadata" tab). But since you can have more than one of each kind
of connector in a pipeline, this should be enough to exercise things fairly
completely.
We still need to address a couple of things in the medium and long term.
First, we need a Tika transformation connector, that extracts metadata from
binary files. There's an existing ticket for that: CONNECTORS-954. If
anyone wants to take a crack at that, please let me know. (Takumi Yoshida
would be the obvious choice.) Second, we need to come up with a strategy
of removing obsolete tabs/features, like the aforementioned general job
Forced Metadata tab. We've got a fair number of such features around now.
These kinds of things cannot be removed without either a comprehensive
automatic upgrade, or loss of backwards compatibility. I am thinking maybe
we break with backwards compatibility and work towards cleaning out
duplicate features for ManifoldCF 2.0.
Thoughts?
Karl