ManifoldCF pipeline design -- heads up to interested parties

Karl Wright Wed, 28 May 2014 07:24:08 -0700

---------- Forwarded message ----------
Hi All,

CONNECTORS-946 represents creating a native, internal ManifoldCF pipeline,
hopefully as part of MCF 1.7.  At this point, before I begin coding, I'd
like to hear any feedback as to the design.  There have been at least three
people wanting this functionality in some form or another over the last
couple of years, so I want to be sure it would meet their likely use cases.


One key question that is still outstanding is whether it makes sense to
have a pipeline be part of a job.  That is the most natural way to do it,
but then the ability to reuse the pipeline itself (not individual stages,
but the whole thing) is limited.

Please comment.

Karl


From: Karl Wright (JIRA) <[email protected]>
Date: Tue, May 27, 2014 at 7:47 PM
Subject: [jira] [Comment Edited] (CONNECTORS-946) Add support for pipeline
connector
To: [email protected]



    [
https://issues.apache.org/jira/browse/CONNECTORS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010505#comment-14010505]

Karl Wright edited comment on CONNECTORS-946 at 5/27/14 11:45 PM:
------------------------------------------------------------------

On second thought, in order to be able to maintain the ability to detect
configuration changes, the Pipeline Connector will have to have a version
string.  This changes the design quite a bit:

- The pipeline connection list for processing is built right in the job
- Each job has an ordered list of pipeline connections it runs on every
document (in a new database table)
- Pipeline connections can have job tabs in the UI (although we have to
figure out something to avoid collisions when the same connection type
appears more than once in one job -- maybe pass in the pipeline connection
name as a parameter to the UI methods)
- There's a TranslationSpecification equivalent to an OutputSpecification
or DocumentSpecification, and a pipeline connector method that explicitly
maps the TranslationSpecification to a version string
- The transformDocument() method accepts the version string, and uses that
where appropriate to control the transformation
- The ingeststatus table has a new sidecar table that holds onto pipeline
connection version strings, for comparison

It's critical that the performance of the ingeststatus table does not
suffer unless there are configured pipeline steps, but I think that would
be relatively straightforward to do, since pipeline version strings will be
directly requested by the worker threads when evaluating whether a document
has changed.



was (Author: [email protected]):
On second thought, in order to be able to maintain the ability to detect
configuration changes, the Pipeline Connector will have to have a version
string.  This changes the design quite a bit:

- The pipeline connection list for processing is built right in the job
- Each job has an ordered list of pipeline connections it runs on every
document (in a new database table)
- Pipeline connections can have job tabs in the UI (although we have to
figure out something to avoid collisions when the same connection type
appears more than once in one job -- maybe pass in the pipeline connection
name as a parameter to the UI methods
- There's a TranslationSpecification equivalent to an OutputSpecification
or DocumentSpecification, and a pipeline connector method that explicitly
maps the TranslationSpecification to a version string
- The transformDocument() method accepts the version string, and uses that
where appropriate to control the transformation
- The ingeststatus table has a new sidecar table that holds onto pipeline
connection version strings, for comparison

It's critical that the performance of the ingeststatus table does not
suffer unless there are configured pipeline steps, but I think that would
be relatively straightforward to do, since pipeline version strings will be
directly requested by the worker threads when evaluating whether a document
has changed.


> Add support for pipeline connector
> ----------------------------------
>
>                 Key: CONNECTORS-946
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-946
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>
> In the Amazon Search Connector, we finally found an example of an output
connector that needed to do full document processing in order to work.
 This ticket represents work in the framework to create a concept of
"pipeline connector".  Pipeline connections would receive
RepositoryDocument objects, and transform them to new RepositoryDocument
objects.  There would be a single important method:
> {code}
> public void transformDocument(RepositoryDocument rd,
ITransformationActivities activities) throws ServiceInterruption,
ManifoldCFException;
> {code}
> ... where ITransformationActivities would include a method that would
send a RepositoryDocument object onward to either the output connection or
to the next pipeline connection.
> Each pipeline connection would have:
> - A name
> - A description
> - Configuration data
> - An optional prerequisite pipeline connection
> Every output connection would have a new field, which is an optional
prerequisite pipeline connection.
> This design is based loosely on how mapping connections and authority
connections interrelate.  An alternate design would involve having per-job
specification information, but I think this would wind up being way too
complex for very little benefit, since each pipeline connection/stage would
be expected to do relatively simple/granular things, not usually involving
interaction with an external system.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

ManifoldCF pipeline design -- heads up to interested parties

Reply via email to