[ 
https://issues.apache.org/jira/browse/CONNECTORS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012336#comment-14012336
 ] 

Karl Wright commented on CONNECTORS-946:
----------------------------------------

Hmm.  The main problem with the chained approach is how connector instances get 
allocated. Instead of allocating a handle, doing the transformation, and 
releasing the handle, because of chaining the process has to look more like 
this:

- allocate output connector handle and all transformation connector handles
- call the chain
- release output connector handle and all transformation connector handles

Astute observers will note that the order in which we allocate these resources 
will be important to prevent deadlock; it must be done in a canonical fashion, 
not just the order in which the connectors appear in the pipeline.  It's also 
the case that the lack of availability of any one connector handle will cause 
the whole thing to need to be backed off and retried.

So, what are the alternatives to chaining?  Well, one prime reason chaining 
seems needed is because RepositoryDocument objects have state.  They don't just 
contain fields; they also contain non-resettable streams.  In a chained model, 
where a transformer explicitly calls the next stage in the pipeline, this is 
fine, but in a model where a RepositoryDocument object is just modified by each 
link in the chain, it's not.  The streams are likely to be read, an then would 
be lost.  Even a transformer method that could potentially return a new 
RepositoryDocument as the result of transformation would wind up producing 
objects whose lifetime and cleanup were unclear, since the RepositoryDocument 
contract requires the RepositoryDocument creator to make sure streams are 
closed after the RepositoryDocument is used.

Simply put, it looks like there's no real choice in the matter...

> Add support for pipeline connector
> ----------------------------------
>
>                 Key: CONNECTORS-946
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-946
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>
> In the Amazon Search Connector, we finally found an example of an output 
> connector that needed to do full document processing in order to work.  This 
> ticket represents work in the framework to create a concept of "pipeline 
> connector".  Pipeline connections would receive RepositoryDocument objects, 
> and transform them to new RepositoryDocument objects.  There would be a 
> single important method:
> {code}
> public void transformDocument(RepositoryDocument rd, 
> ITransformationActivities activities) throws ServiceInterruption, 
> ManifoldCFException;
> {code}
> ... where ITransformationActivities would include a method that would send a 
> RepositoryDocument object onward to either the output connection or to the 
> next pipeline connection.
> Each pipeline connection would have:
> - A name
> - A description
> - Configuration data
> - An optional prerequisite pipeline connection
> Every output connection would have a new field, which is an optional 
> prerequisite pipeline connection.
> This design is based loosely on how mapping connections and authority 
> connections interrelate.  An alternate design would involve having per-job 
> specification information, but I think this would wind up being way too 
> complex for very little benefit, since each pipeline connection/stage would 
> be expected to do relatively simple/granular things, not usually involving 
> interaction with an external system.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to