[ 
https://issues.apache.org/jira/browse/CONNECTORS-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-989.
------------------------------------

       Resolution: Fixed
    Fix Version/s: ManifoldCF 1.7

r1612102

> Support virtual child document model
> ------------------------------------
>
>                 Key: CONNECTORS-989
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-989
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework agents process
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>
> In some cases, documents that are indexed may be virtual children of those 
> that are queued.  A good example of this is RSS feeds where the data being 
> indexed all comes from the feed.
> In order to implement this, the following changes would be required:
> (1) IProcessActivity.ingestDocument() has a variant which allows you to 
> include a virtual child document identifier in addition to the main document 
> identifier.
> (2) IIncrementalIngester's addOrReplaceDocument receives TWO document keys -- 
> one for main (queued) document identifier, one for child virtual document 
> identifier.
> (3) IIncrementalIngester has two new methods: beginDocument() and 
> endDocument(), both of which take a main (queued) document identifier as an 
> argument.
> (4) ingeststatus table has two additional columns: a state, and a child key.
> (5) The flow is: at beginDocument() time, put all records relating to a 
> document into a "processing" state.  Documents that are seen have their state 
> changed.  Documents never
>    encountered are deleted at the end.
> (6) Incremental decisions not to update an output record STILL will require 
> that the record be touched and its state set.
> (7) DocumentIngest records for the entire set of children will be fetched 
> when the document is queued.
> (8) The getDocumentVersions() method must be modified to allow return of 
> version strings for all children, although there can be "shortcuts" as well 
> (where a single version
>     string applies to all children.)
> (9) The decision about whether to refetch a document is based on the returned 
> version strings and on those fetched by the stuffer thread.
> (10) Similarly, processDocuments() receives version strings for all virtual 
> children.
> (11) There is no need to actively reset the state of document records on 
> restart; the current logic should be robust enough to be able to generate the 
> required deletions.
> (12) Deleting a document deletes ALL child virtual documents.  This happens 
> within the incremental ingester.
> (13) Requeuing interval must be computed across all children, taking the 
> minimum, since there's no requirement that an ingeststatus record exist for 
> the parent.
> (14) All other logic, including making sure only one agent operates on a url 
> at a time, is the same.
> (15) Interrupting the delete phase is safe because next time the doc is 
> processed the records will be removed.
> Analysis:
> - The critical thing is making the non-virtual case no worse.
> - For a virtual child document, instead of one db access, there are two.
> - For document records that are not changed, there are two additional writes 
> that were not needed before.
> - There's an additional index (or the document key index has another 
> subfield).
> - If the queries written can be done in such a way as to treat the standard 
> (no child document) case specially, we may be able to avoid much impact; only 
> two index queries per document returning zero rows each
> - If we handle the standard case using the same mechanism, the WorkerThread 
> logic dealing with deletions can go away.
> Summary:
> - Additional database overhead in the non-virtual indexing case consists of 
> one additional write and one additional zero-row query, OR two additional 
> zero-row queries.
> - Additional database overhead in the non-virtual skip case consists of two 
> additional writes, OR two additional zero-row queries.
> - The overhead is low but is significant and will impact overall framework 
> performance
> - The up-sides are as follows: (a) handling an important but infrequent case 
> better; (b) less connector involvement in indexing (e.g., 
> IProcessActivity.deleteDocument() does nothing now, and can be deprecated).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to