Re: Revisiting: Should Manifold include Pipelines

Karl Wright Mon, 09 Jan 2012 23:50:16 -0800

Hi Mark,

Please see below.

On Mon, Jan 9, 2012 at 9:53 PM, Mark Bennett <mbenn...@ideaeng.com> wrote:
> Hi Karl,
>
> Thanks for the reply, most comments inline.
>
> General comments:
>
> I was wondering if you've used a custom pipeline like FAST ESP or
> Ultraseek's old "patches.py", and if there were any that you liked or
> disliked?  In more recent times the OpenPipeline effort has been a bit
> nascent, I think in part because it lacks some of connectors.  Coming from
> my background I'm probably a bit biased to thinking of problems in terms of
> a pipeline, and it's also a frequent discussion with some of our more
> challenging clients.
>
> Generally speaking we define the virtual document to be the basic unit of
> retrieval, and it doesn't really matter whether it starts life as a Web
> Page or PDF or Outlook node.  Most "documents" have a create / modified
> date, some type of title, and a few other semi-common meta data fields.
> They do vary by source, but there's mapping techniques.
>
> Having more connector services, or even just more examples, is certainly a
> step in the right direction.
>
> But leaving it at writing custom monolithic connectors has a few
> disadvantages:
> - Not as modular, so discourages code reuse
> - Maintains 100% coding, vs. some mix of configure vs. code
> - Keeps the bar at rather advanced Java programmers, vs. opening up to
> folks that feel more comfortable with "scripting" (of a sort, not
> suggesting a full language)
> - I think folks tend to share more when using "configurable" systems,
> though I have no proof.  I might just be the larger number of people.
> - Sort of the "blank canvas syndrome" as each person tries to grasp all the
> nuances; granted one I'm suggesting merely presents a smaller blank canvas,
> but maybe with crayons and connect the dots, vs. oil paints.
>

It sounds to me like what you are proposing is a reorganization of the
architecture of ManifoldCF so that documents that are fetched by
repository connectors are only obliquely related to documents indexed
through an output connector.  You are proposing that an indexed
document be possibly assembled from multiple connector sources, but
with arbitrary manipulation of the document content along the way.  Is
this correct?

If so, how would you handle document security?  Each repository
connection today specifies the security context for the documents it
fetches.  It also knows about relationships between those documents
that come from the same connector, and about document versioning for
documents fetched from that source.  How does this translate into a
pipelined world in your view?  Is the security of the final indexed
document the intersection of the security for all the sources of the
indexed document?  Is the version of the indexed document the
concatenation of the versions of all the input documents?

>> > Generally it's the need to deeply parse a document before instructing the
>> > spider what next action to take next.
>> >
>>
>> We've looked at this as primarily a connector-specific activity.  For
>> example, you wouldn't want to do such a thing from within documents
>> fetched via JCIFs.  The main use case I can see is in extracting links
>> from web content.
>>
>
> There are so many more things to be extracted in the world.... and things
> that a spider can use.
>
> I don't understand the comment about JCIFs.  Presumably there's still the
> concept of a unit of retrieval, some "document" or "page", with some type
> of title and URL?
>

My example was meant to be instructive, not all-inclusive.  Documents
that are fetched from a Windows shared filesystem do not in general
point at other documents within the same Windows shared filesystem.
There is no point in that case in parsing those documents looking for
links; it just slows the whole system down.  The only kinds of
universal links you might find in a document from any arbitrary source
will likely be web urls, is all that I'm saying.

>
> We might be talking past each other here, and maybe we're already agreeing.
>
> So I'm a developer and I need a fancy connector that pulls from multiple
> sources.
>
> But then I notice that ManifoldCF already has connectors for all 3 of my
> sources.
>
> So yes, I'd need to write some custom code.  But I want to "plugin"
> existing manaifold connectors, but route their output as input to my
> connector.
>
> Or more likely, I'll be pullilng "primary" records form on of the existing
> manaifold connectors, and will then make requests to 1 or 2 other MCF
> connectors to fill-in additional details.
>
> Maybe I can do this now?  Maybe it's so trivial to you that it didn't even
> seem like a question???
>

No, this is not trivial now.  Code-driven assembly of access to
multiple connectors and amalgamation into documents would, like I
said, require very significant changes to the ManifoldCF architecture.
 For example, your question overlooks several major features of the
way ManifoldCF works at the moment:

- Each repository connector supplies its own means of specifying what
documents should be included in a job, and the means of editing that
specification via the UI.
- Each repository connector knows how to handle document versioning so
that the framework can support incremental crawling properly.
- Each repository connector knows how to generate security information
for the documents it fetches.

If you adopt a code-driven assembly pipeline, it's hard to see how all
of this would come together.  And yet it must, unless what you are
really looking for is Nutch but with repository connector components.
By having the criteria for what documents to include in a crawl be
part of some pipeline code, you take it out of the non-programmer's
hands.  There may be ways of respecifying what a connector is that
dodge this problem but I certainly don't know what that might look
like yet.

>
> I have the book, I'll go check those sections.
>
> Is there some chance of porting some of this info to the Wiki?  I've
> noticed references to the book in a couple emails, which is fine, I
> reference my book every now and then as too.  But for open source info it'd
> be kind of a bummer to force people to shell out $ for a book.  Not sure
> what type of a deal you have with the publisher.  Or maybe somebody else
> would have to create the equivalent info from scratch in wiki?
>

The other place it is found is in the javadoc for the IProcessActivity
interface.  You can see how it is used by looking at the RSS
connector.

I'm not trying to make people spend money on a book but I honestly
don't have the time to write the same thing two or three times.
Please feel free to come up with contributions to the ManifoldCF
documentation that's based on the book content.  You obviously can't
cut-and-paste, but you can digest the material if you think it would
be helpful to others.

>>
>> (1) Providing a content-extraction and modification pipeline, for
>> those output connectors that are targeting systems that cannot do
>> content extraction on their own.
>> (2) Providing framework-level services that allow "connectors" to be
>> readily constructed along a pipeline model.
>>
>
> A good start, let me add to it:
>
> (3) Easily use other Connectors as both inputs and outputs to a custom
> connector.
>
> I'm not sure whether it's better to have such hybrids mimic a datasource
> connector, an output connector, or maybe slot in as a security connector?
>
> Conceptually the security connectors are both "input" and "output", so
> presumably would be easier to chain?  But it'd be a bid odd to hang off of
> the "security" side of things for a process that just tweaks metadata.

This is where we run into trouble.  From my perspective, the security
of a document involves stuff that can only come from a repository,
combined with security information that comes from one (or more)
authorities. It makes absolutely no sense to have a "document security
pipeline", because there is not enforcement of security within
ManifoldCF itself; that happens elsewhere.

> Also, I don't know if security connectors have access to all of the data
> and the ability to modify metadata / document content.
>
> (4) Inclusion of Tika for filtering, which is often needed.
>
> (5) Ability for a custom connector to inject additional URLs into the
> various queues
>
> (6) Some type of "accountability" for work that has been submitted.  So a
> record comes in on connector A, I then generate requests to connectors B
> and C, and I'd like to be efficiently called back when those other tasks
> are completed or have failed.
>

I think it is now clear that you are thinking of ManifoldCF as Nutch
with connectors, where you primarily code up your crawl in Java and
fire it off that way.  But that's not the problem that ManifoldCF set
out to address.  I'm not adverse to making ManifoldCF change
architecturally to support this kind of thing, but I don't think we
should forget ManifoldCF's primary mission along the way.

>
> Being able to configure custom pipelines would be even better, but not a
> deal breaker.  Obviously most Manifold users are Java coders at the moment,
> so re-usablity could come at a later time.
>

Actually, most ManifoldCF users are *not* java coders - that's where
our ideas fundamentally differ.  The whole reason there's a crawler UI
in the first place is so someone who doesn't want to code can set up
crawls and run them.

I believe I have a clearer idea of what you are looking for.  Please
correct me if you disagree.  I'll ponder some of the architectural
questions and see if I can arrive at a proposal that meets most of the
goals.

Karl

Re: Revisiting: Should Manifold include Pipelines

Reply via email to