[jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions

Karl Wright (JIRA) Thu, 15 Nov 2012 02:36:20 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497916#comment-13497916
 ]


Karl Wright commented on CONNECTORS-567:
----------------------------------------

There are a number of connectors that need to do version checks across many 
threads, not just one, which is why I originally designed the connector 
interface the way I did.

I could imagine supporting both models, however.  The IxxxActivity interfaces 
were invented to allow the crawling model to be extended without breaking 
existing connectors.  All you would have to do (in theory) to support something 
like what you are talking about would be to add a new ISeedingActivity method 
that would record not only a document's discovery, but also its version 
information.

However, this is not a trivial change internally, because the flow at the 
moment involves obtaining the version information in the same worker thread 
that would process the information if the version indicated that was needed.  
So dispatch to the worker thread will have already taken place either way, and 
the only real difference would be that somehow we'd decide it was unnecessary 
to call getDocumentVersions() for certain documents.  But you'd still need to 
support getDocumentVersions() for older documents, as you point out, so I'm 
having a bit of a hard time figuring out exactly when a document would be "old 
enough" to call getDocumentVersions().

A much easier model would be to support an all-in-one approach, which might be 
appropriate for something like JDBC.  In that model the seeding query returns 
everything, and getDocumentVersions() and processDocuments() does nothing.

It may be worth reading ManifoldCF in Action, especially the parts about 
crawling models, since that may help inform your thoughts a bit.

                
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
>                 Key: CONNECTORS-567
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
>             Project: ManifoldCF
>          Issue Type: Wish
>            Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with 
> data it already has.
> Current data flow needs one call to addSeedDocuments, then call to 
> getDocumentVersions, which essentialy must fetch same data, and after that 
> one more call to processDocuments. The last one probably needs separate call 
> because it needs to fetch document body, however seeding and getting versions 
> in many cases work on very same data (and probably duplicating requests to 
> repository).
> Now - reducing number of needed request to repository by eliminating 
> getDocumentVersions call for document which have version returned by 
> addSeedDocuments could significantly reduce load.
> getDocumentVersions would still be called for older docuemnts (not returned 
> by addSeedDocuments) to check if they were modified or deleted.
> This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions

Reply via email to