[ 
https://issues.apache.org/jira/browse/CONNECTORS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017277#comment-14017277
 ] 

Karl Wright commented on CONNECTORS-946:
----------------------------------------

In branches/CONNECTORS-946, I've implemented the following:

- Support for new connector type, "transformation"
- Integration of "transformation" connections into jobs, save one issue I'll 
discuss below
- Crawler UI rework to handle all this new stuff, including fixes for a problem 
I found that's described below

What's not yet done is the following:

- IncrementalIngester accepting transformer pipeline
- Job UI: transformer activity integration
- Worker Thread: pass transformer pipeline to IncrementalIngester
- API extensions (new transformers, and changes to jobs)
- Extended import/export of jobs

In addition, found and fixed a UI problem where the DOCTYPE directive was being 
included in the wrong place, which yielded incorrect XML for pretty much every 
page.  Also, forced metadata changes did *not* reset the seeding time to 0, 
which is also now fixed.

The job integration part that's not done, which may cause considerable pain to 
fix, is how the system handles deregistration of a connector while the system 
is active.  For repository and output connections, all the pertinent jobs are 
notified, and they transition to appropriate job states which prevent them from 
running.  For transformer pipelines, the number of states would grow 
combinatorically as the number of transformers in the pipeline grows.
I am considering replacing this system with one that has just one set of 
special states that would apply when any of the pipeline is unregistered.  
Then, either a new job field that contains a count of unregistered connectors 
associated with the job, or code that does a dynamic reassessment of each job's 
state whenever notification occurs, would perform the necessary state 
transitions.  State transitions would also occur now because of pipeline 
changes at job save time.



> Add support for pipeline connector
> ----------------------------------
>
>                 Key: CONNECTORS-946
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-946
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>
> In the Amazon Search Connector, we finally found an example of an output 
> connector that needed to do full document processing in order to work.  This 
> ticket represents work in the framework to create a concept of "pipeline 
> connector".  Pipeline connections would receive RepositoryDocument objects, 
> and transform them to new RepositoryDocument objects.  There would be a 
> single important method:
> {code}
> public void transformDocument(RepositoryDocument rd, 
> ITransformationActivities activities) throws ServiceInterruption, 
> ManifoldCFException;
> {code}
> ... where ITransformationActivities would include a method that would send a 
> RepositoryDocument object onward to either the output connection or to the 
> next pipeline connection.
> Each pipeline connection would have:
> - A name
> - A description
> - Configuration data
> - An optional prerequisite pipeline connection
> Every output connection would have a new field, which is an optional 
> prerequisite pipeline connection.
> This design is based loosely on how mapping connections and authority 
> connections interrelate.  An alternate design would involve having per-job 
> specification information, but I think this would wind up being way too 
> complex for very little benefit, since each pipeline connection/stage would 
> be expected to do relatively simple/granular things, not usually involving 
> interaction with an external system.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to