Karl Wright created CONNECTORS-989:
--------------------------------------

             Summary: Support virtual child document model
                 Key: CONNECTORS-989
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-989
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Framework agents process
    Affects Versions: ManifoldCF 1.7
            Reporter: Karl Wright
            Assignee: Karl Wright


In some cases, documents that are indexed may be virtual children of those that 
are queued.  A good example of this is RSS feeds where the data being indexed 
all comes from the feed.

In order to implement this, the following changes would be required:

(1) IProcessActivity.ingestDocument() has a variant which allows you to include 
a virtual child document identifier in addition to the main document identifier.
(2) IIncrementalIngester's addOrReplaceDocument receives TWO document keys -- 
one for main (queued) document identifier, one for child virtual document 
identifier.
(3) IIncrementalIngester has two new methods: beginDocument() and 
endDocument(), both of which take a main (queued) document identifier as an 
argument.
(4) ingeststatus table has two additional columns: a state, and a child key.
(5) The flow is: at beginDocument() time, put all records relating to a 
document into a "processing" state.  Documents that are seen have their state 
changed.  Documents never
   encountered are deleted at the end.
(6) Incremental decisions not to update an output record STILL will require 
that the record be touched and its state set.
(7) DocumentIngest records for the entire set of children will be fetched when 
the document is queued.
(8) The getDocumentVersions() method must be modified to allow return of 
version strings for all children, although there can be "shortcuts" as well 
(where a single version
    string applies to all children.)
(9) The decision about whether to refetch a document is based on the returned 
version strings and on those fetched by the stuffer thread.
(10) Similarly, processDocuments() receives version strings for all virtual 
children.
(11) There is no need to actively reset the state of document records on 
restart; the current logic should be robust enough to be able to generate the 
required deletions.
(12) Deleting a document deletes ALL child virtual documents.  This happens 
within the incremental ingester.
(13) Requeuing interval must be computed across all children, taking the 
minimum, since there's no requirement that an ingeststatus record exist for the 
parent.
(14) All other logic, including making sure only one agent operates on a url at 
a time, is the same.
(15) Interrupting the delete phase is safe because next time the doc is 
processed the records will be removed.

Analysis:
- The critical thing is making the non-virtual case no worse.
- For a virtual child document, instead of one db access, there are two.
- For document records that are not changed, there are two additional writes 
that were not needed before.
- There's an additional index (or the document key index has another subfield).
- If the queries written can be done in such a way as to treat the standard (no 
child document) case specially, we may be able to avoid much impact; only two 
index queries per document returning zero rows each
- If we handle the standard case using the same mechanism, the WorkerThread 
logic dealing with deletions can go away.

Summary:
- Additional database overhead in the non-virtual indexing case consists of one 
additional write and one additional zero-row query, OR two additional zero-row 
queries.
- Additional database overhead in the non-virtual skip case consists of two 
additional writes, OR two additional zero-row queries.
- The overhead is low but is significant and will impact overall framework 
performance
- The up-sides are as follows: (a) handling an important but infrequent case 
better; (b) less connector involvement in indexing (e.g., 
IProcessActivity.deleteDocument() does nothing now, and can be deprecated).




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to