Re: Revisiting: Should Manifold include Pipelines

Karl Wright Tue, 10 Jan 2012 00:28:47 -0800

As an exercise in understanding, it might be helpful to consider how
exactly a document specification in today's ManifoldCF would morph if
you wanted a connection to be a pipeline component rather than what it
is today.


Right now, the document specification for a job is an XML doc of a
form that only the underlying connector understands, which specifies
the following kinds of information:

- What documents to include in the crawl (which is meaningful only in
the context of an existing underlying connection);
- What parts of those documents to index (e.g. what metadata is included).

The information is used in several places during the crawl:

- At the time seeding is done (the initial documents)
- When a decision is being made to include a document on the queue
- Before a document is going to be fetched
- In order to set up the document for indexing

The repository connector allows you to edit the document specification
in the Crawler UI.  This is done by the repository connector
contributing tabs to the job.

Now, in order for a pipeline to work, most of the activities of the
connector will need to be broken out into separate pipeline tasks.
For instance, "seeding" would be a different task from "filtering"
which would be different from "enqueuing" which would be different
from "obtaining security info".  I would expect that each pipeline
step would have its own UI, so if you were using Connection X to seed,
then you would want to specify what documents to seed in the UI for
that step, in a manner consistent with the underlying connection.

So the connector would need to break up its document specification
into multiple pieces, e.g. a "seeding document specification" with a
seeding document specification UI.  There would be a corresponding
specification and UI for "connector document filtering" and for
"connector document enqueuing".  I suspect there would be a lot of
duplication and overlap too, which would be hard to avoid.

The end result of this exercise would be something that would allow
more flexibility, at the expense of ease of use.

Karl

On Tue, Jan 10, 2012 at 2:49 AM, Karl Wright <daddy...@gmail.com> wrote:
> Hi Mark,
>
> Please see below.
>
> On Mon, Jan 9, 2012 at 9:53 PM, Mark Bennett <mbenn...@ideaeng.com> wrote:
>> Hi Karl,
>>
>> Thanks for the reply, most comments inline.
>>
>> General comments:
>>
>> I was wondering if you've used a custom pipeline like FAST ESP or
>> Ultraseek's old "patches.py", and if there were any that you liked or
>> disliked?  In more recent times the OpenPipeline effort has been a bit
>> nascent, I think in part because it lacks some of connectors.  Coming from
>> my background I'm probably a bit biased to thinking of problems in terms of
>> a pipeline, and it's also a frequent discussion with some of our more
>> challenging clients.
>>
>> Generally speaking we define the virtual document to be the basic unit of
>> retrieval, and it doesn't really matter whether it starts life as a Web
>> Page or PDF or Outlook node.  Most "documents" have a create / modified
>> date, some type of title, and a few other semi-common meta data fields.
>> They do vary by source, but there's mapping techniques.
>>
>> Having more connector services, or even just more examples, is certainly a
>> step in the right direction.
>>
>> But leaving it at writing custom monolithic connectors has a few
>> disadvantages:
>> - Not as modular, so discourages code reuse
>> - Maintains 100% coding, vs. some mix of configure vs. code
>> - Keeps the bar at rather advanced Java programmers, vs. opening up to
>> folks that feel more comfortable with "scripting" (of a sort, not
>> suggesting a full language)
>> - I think folks tend to share more when using "configurable" systems,
>> though I have no proof.  I might just be the larger number of people.
>> - Sort of the "blank canvas syndrome" as each person tries to grasp all the
>> nuances; granted one I'm suggesting merely presents a smaller blank canvas,
>> but maybe with crayons and connect the dots, vs. oil paints.
>>
>
> It sounds to me like what you are proposing is a reorganization of the
> architecture of ManifoldCF so that documents that are fetched by
> repository connectors are only obliquely related to documents indexed
> through an output connector.  You are proposing that an indexed
> document be possibly assembled from multiple connector sources, but
> with arbitrary manipulation of the document content along the way.  Is
> this correct?
>
> If so, how would you handle document security?  Each repository
> connection today specifies the security context for the documents it
> fetches.  It also knows about relationships between those documents
> that come from the same connector, and about document versioning for
> documents fetched from that source.  How does this translate into a
> pipelined world in your view?  Is the security of the final indexed
> document the intersection of the security for all the sources of the
> indexed document?  Is the version of the indexed document the
> concatenation of the versions of all the input documents?
>
>
>>> > Generally it's the need to deeply parse a document before instructing the
>>> > spider what next action to take next.
>>> >
>>>
>>> We've looked at this as primarily a connector-specific activity.  For
>>> example, you wouldn't want to do such a thing from within documents
>>> fetched via JCIFs.  The main use case I can see is in extracting links
>>> from web content.
>>>
>>
>> There are so many more things to be extracted in the world.... and things
>> that a spider can use.
>>
>> I don't understand the comment about JCIFs.  Presumably there's still the
>> concept of a unit of retrieval, some "document" or "page", with some type
>> of title and URL?
>>
>
> My example was meant to be instructive, not all-inclusive.  Documents
> that are fetched from a Windows shared filesystem do not in general
> point at other documents within the same Windows shared filesystem.
> There is no point in that case in parsing those documents looking for
> links; it just slows the whole system down.  The only kinds of
> universal links you might find in a document from any arbitrary source
> will likely be web urls, is all that I'm saying.
>
>>
>> We might be talking past each other here, and maybe we're already agreeing.
>>
>> So I'm a developer and I need a fancy connector that pulls from multiple
>> sources.
>>
>> But then I notice that ManifoldCF already has connectors for all 3 of my
>> sources.
>>
>> So yes, I'd need to write some custom code.  But I want to "plugin"
>> existing manaifold connectors, but route their output as input to my
>> connector.
>>
>> Or more likely, I'll be pullilng "primary" records form on of the existing
>> manaifold connectors, and will then make requests to 1 or 2 other MCF
>> connectors to fill-in additional details.
>>
>> Maybe I can do this now?  Maybe it's so trivial to you that it didn't even
>> seem like a question???
>>
>
> No, this is not trivial now.  Code-driven assembly of access to
> multiple connectors and amalgamation into documents would, like I
> said, require very significant changes to the ManifoldCF architecture.
>  For example, your question overlooks several major features of the
> way ManifoldCF works at the moment:
>
> - Each repository connector supplies its own means of specifying what
> documents should be included in a job, and the means of editing that
> specification via the UI.
> - Each repository connector knows how to handle document versioning so
> that the framework can support incremental crawling properly.
> - Each repository connector knows how to generate security information
> for the documents it fetches.
>
> If you adopt a code-driven assembly pipeline, it's hard to see how all
> of this would come together.  And yet it must, unless what you are
> really looking for is Nutch but with repository connector components.
> By having the criteria for what documents to include in a crawl be
> part of some pipeline code, you take it out of the non-programmer's
> hands.  There may be ways of respecifying what a connector is that
> dodge this problem but I certainly don't know what that might look
> like yet.
>
>
>>
>> I have the book, I'll go check those sections.
>>
>> Is there some chance of porting some of this info to the Wiki?  I've
>> noticed references to the book in a couple emails, which is fine, I
>> reference my book every now and then as too.  But for open source info it'd
>> be kind of a bummer to force people to shell out $ for a book.  Not sure
>> what type of a deal you have with the publisher.  Or maybe somebody else
>> would have to create the equivalent info from scratch in wiki?
>>
>
> The other place it is found is in the javadoc for the IProcessActivity
> interface.  You can see how it is used by looking at the RSS
> connector.
>
> I'm not trying to make people spend money on a book but I honestly
> don't have the time to write the same thing two or three times.
> Please feel free to come up with contributions to the ManifoldCF
> documentation that's based on the book content.  You obviously can't
> cut-and-paste, but you can digest the material if you think it would
> be helpful to others.
>
>>>
>>> (1) Providing a content-extraction and modification pipeline, for
>>> those output connectors that are targeting systems that cannot do
>>> content extraction on their own.
>>> (2) Providing framework-level services that allow "connectors" to be
>>> readily constructed along a pipeline model.
>>>
>>
>> A good start, let me add to it:
>>
>> (3) Easily use other Connectors as both inputs and outputs to a custom
>> connector.
>>
>> I'm not sure whether it's better to have such hybrids mimic a datasource
>> connector, an output connector, or maybe slot in as a security connector?
>>
>> Conceptually the security connectors are both "input" and "output", so
>> presumably would be easier to chain?  But it'd be a bid odd to hang off of
>> the "security" side of things for a process that just tweaks metadata.
>
> This is where we run into trouble.  From my perspective, the security
> of a document involves stuff that can only come from a repository,
> combined with security information that comes from one (or more)
> authorities. It makes absolutely no sense to have a "document security
> pipeline", because there is not enforcement of security within
> ManifoldCF itself; that happens elsewhere.
>
>> Also, I don't know if security connectors have access to all of the data
>> and the ability to modify metadata / document content.
>>
>> (4) Inclusion of Tika for filtering, which is often needed.
>>
>> (5) Ability for a custom connector to inject additional URLs into the
>> various queues
>>
>> (6) Some type of "accountability" for work that has been submitted.  So a
>> record comes in on connector A, I then generate requests to connectors B
>> and C, and I'd like to be efficiently called back when those other tasks
>> are completed or have failed.
>>
>
> I think it is now clear that you are thinking of ManifoldCF as Nutch
> with connectors, where you primarily code up your crawl in Java and
> fire it off that way.  But that's not the problem that ManifoldCF set
> out to address.  I'm not adverse to making ManifoldCF change
> architecturally to support this kind of thing, but I don't think we
> should forget ManifoldCF's primary mission along the way.
>
>>
>> Being able to configure custom pipelines would be even better, but not a
>> deal breaker.  Obviously most Manifold users are Java coders at the moment,
>> so re-usablity could come at a later time.
>>
>
> Actually, most ManifoldCF users are *not* java coders - that's where
> our ideas fundamentally differ.  The whole reason there's a crawler UI
> in the first place is so someone who doesn't want to code can set up
> crawls and run them.
>
> I believe I have a clearer idea of what you are looking for.  Please
> correct me if you disagree.  I'll ponder some of the architectural
> questions and see if I can arrive at a proposal that meets most of the
> goals.
>
>
> Karl

Re: Revisiting: Should Manifold include Pipelines

Reply via email to