[
https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497916#comment-13497916
]
Karl Wright commented on CONNECTORS-567:
----------------------------------------
There are a number of connectors that need to do version checks across many
threads, not just one, which is why I originally designed the connector
interface the way I did.
I could imagine supporting both models, however. The IxxxActivity interfaces
were invented to allow the crawling model to be extended without breaking
existing connectors. All you would have to do (in theory) to support something
like what you are talking about would be to add a new ISeedingActivity method
that would record not only a document's discovery, but also its version
information.
However, this is not a trivial change internally, because the flow at the
moment involves obtaining the version information in the same worker thread
that would process the information if the version indicated that was needed.
So dispatch to the worker thread will have already taken place either way, and
the only real difference would be that somehow we'd decide it was unnecessary
to call getDocumentVersions() for certain documents. But you'd still need to
support getDocumentVersions() for older documents, as you point out, so I'm
having a bit of a hard time figuring out exactly when a document would be "old
enough" to call getDocumentVersions().
A much easier model would be to support an all-in-one approach, which might be
appropriate for something like JDBC. In that model the seeding query returns
everything, and getDocumentVersions() and processDocuments() does nothing.
It may be worth reading ManifoldCF in Action, especially the parts about
crawling models, since that may help inform your thoughts a bit.
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
> Key: CONNECTORS-567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-567
> Project: ManifoldCF
> Issue Type: Wish
> Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with
> data it already has.
> Current data flow needs one call to addSeedDocuments, then call to
> getDocumentVersions, which essentialy must fetch same data, and after that
> one more call to processDocuments. The last one probably needs separate call
> because it needs to fetch document body, however seeding and getting versions
> in many cases work on very same data (and probably duplicating requests to
> repository).
> Now - reducing number of needed request to repository by eliminating
> getDocumentVersions call for document which have version returned by
> addSeedDocuments could significantly reduce load.
> getDocumentVersions would still be called for older docuemnts (not returned
> by addSeedDocuments) to check if they were modified or deleted.
> This is only proposition...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira