Re: Revisiting: Should Manifold include Pipelines

Mark Bennett Wed, 11 Jan 2012 22:37:32 -0800

Hi Karl,

On Wed, Jan 11, 2012 at 4:21 AM, Karl Wright <daddy...@gmail.com> wrote:

> Hi Mark,
>
> I think I'd describe this simplified proposal as "pipeline" (vs.
> "Pipeline".  Your original description was the latter.)  This proposal
> is simpler but does not have the ability to amalgamate content from
> multiple connectors, correct?

Yes.

>  As long as it is just modifying the
> content and metadata (as described by RepositoryDocument), it's not
> hard to develop a generic idea of a content processing pipeline, e.g.
> Tika.
>

Yay!

>
> There's a question in my mind as to where it belongs.  If its purpose
> is to make up for missing code in particular search engines, then I'd
> argue it should be a service available to output connector coders, who
> can then choose how much configurability makes sense from the point of
> view of their target system.

I'm not sure if this question is revisiting the motivation for preferring
this in MCF, or a technical question about how to package metadata for
different engines that might want it in a different format.

For the former, I'd briefly rehash my answer earlier in the thread.
Pipelines are not in every search engine, and many organizations deal with
multiple search engines, so having a more standard for that logic would be
awesome!

For the latter, how to pass metadata to engines, that's interesting.  One
almost universal way is to add metadata tags the header portion of an HTML
file.  There are some other microformats that some engines understand.
Could we just assume, for now, that additional meta data will be jammed
into the HTML header, perhaps with an "x-" for the name (a convention some
folks like).

>  For instance, since Tika is already part
> of Solr, there would seem little benefit in adding a Tika pipeline
> upstream of Solr as well, but maybe a Google Appliance connector would
> want it and therefore expose it.

Including Tika would be useful for connectors that need to look at binary
doc files to do their parsing.  Even if the pipeline then discards Tika's
output when it's done, it's still a likely expense *if* it's meets the
project objective.

As an example, the current MCF system looks for links in HTML.  But
hyperlinks can also appear in Word, Excel and PDF files.  Tika could, in
theory, convert those docs so that they cal also be scanned for links, and
then later discard that converted file.

Another attractive pipeline would have Tika convert non-HTML binary files
into some primitive for of HTML and then perhaps discard the original
binary.  So binary content would be transformed into HTML or TXT in the MCF
stage, INSTEAD of having a search engine (or other system to do it).  It's
still a 1:1 transformation.

>  If the pipeline's purpose is to
> include arbitrary business logic, on the other hand, then I think what
> you'd really need is a Pipeline and not a pipeline, if you see what I
> mean.
>

Given the dismal state of open tools, I'd be excited to just see 1:1
"pipeline" functionality be made widely available.

I'm regretting, to some extent, bringing in the more complex Pipeline logic
as it may have partially derailed the conversation.  I'm one of the authors
of the old XPump tool, which was able to do very fancy things, but suffered
from other issues.

But better to have something now then nothing.  And I'll ponder the more
complex scenarios some more.

>
> So, my question to you is, what would the main use case(s) be for a
> "pipeline" in your view?
>

I've given a couple examples above, of 1:1 transforms.  I *KNOW* this is of
interest to some folks, but it sounds like I've failed to convince you.
I'd ask you to take it on faith, but you don't know me very well, so that'd
be asking a lot.

>From an academic exercise, suppose it was given that this was a good thing
to do, then what would be the easiest way to do it?

A final question for you Karl, since we've both invested some time in
discussing something that would normally be very complex to others.  What
open source tools would YOU suggest I look at, for a new home for uber
pipeline processing?  I think you understand some of the logical
functionality I want to model.

Some other wish list items:
* Leverage MCF connectors
* A web UI framework for monitoring

I'd say up front that I've considered Nutch, but I don't think it's a good
fit for other reasons.

I'm still looking around at UIMA.  I keep finding the justification for
UIMA, how awesome it is, but less on the technical side.  I'm not sure it
models a data flow design that well.

The other area I looked at was some of the Eclipse process graph stuff,
"Business Process Management" I think.

There's a TON of open source projects.

>
> Karl
>
> On Wed, Jan 11, 2012 at 6:31 AM, Mark Bennett <mbenn...@ideaeng.com>
> wrote:
> > Hi Karl,
> >
> > Still pondering our last discussion.  Wondering if I got things off
> track.
> >
> > As a start, what if I backtracked a bit, to this:
> >
> > What's the easiest way to do this:
> > * A connector that tweaks metadata form a single source.
> > * Sits between any existing MCF datasource connector and the main MCF
> engine
> >
> > Before:
> >
> > CMS/DB -> Existing MCF connector -> MCF core -> output
> >
> > After:
> >
> > CMS/DB -> Existing MCF connector -> Metadata tweaker -> MCF core ->
> output
> >
> >
> > Assume the matadata changes don't have any impact on security, or that no
> > security is being used (public data)
>

Re: Revisiting: Should Manifold include Pipelines

Reply via email to