[jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions

Maciej Lizewski (JIRA) Thu, 15 Nov 2012 03:40:19 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497956#comment-13497956
 ]


Maciej Lizewski commented on CONNECTORS-567:
--------------------------------------------

I would also go with two scenarios to maintain compatibility with current model.

My point is that there plenty case when listing document also gives you 
information about its version: directory listing gives you file modyfication 
time, SQL query can return document ID and its version, web interfaces (REST, 
WebService) often support scenario: getObjectsList which gives you document IDs 
and almost always some document information like modyfication time, version, 
owner, etc and separate method for fetching whole document.

Your proposition to have all-in-one is not as good because: like I said earlier 
common interfaces have separate methods for fetching lists and single documents 
and you would have to first fetch the list and then for every document fetch 
its conent. Another reason is that in real world documents are not changed very 
often and fetching their content every time is much not needed overhead.

And last but not least - what I mean by "old enough" - when you call 
addSeedDocuments there are several scenarios but in most cases this method can 
provide new documents, updated documents and often all other documents that 
still exists. There are still some documents that were deleted and 
addSeedDocuemnts mostly will not return them. they are injected to reindexing 
process from database  of previously indexed document, and when 
getDocumentVersion returns null - they are removed. That is clear and this is 
what I mainly meant: getDocumentVesrions could be used to fetch versions for 
documents that are already in our database, but addSeedDocuemnts did not 
returned them (either because they were deleted or they were just not modified 
and addSeedDocuments just return new and modified documents)

So I was thinking of such (re)indexing process:
1. mark all already indexed document to re-index
2. call addSeedDocuments which can provide versions for documents or not
3. call getDocumentVersions for all documents that were not added by 
addSeedDocuments with version (this means that it should be called also for 
documents added by addSeedDocuemnts but without version - this is the backward 
compatibility)
4. call processDocuments as usual.

now - if addSeedDocuments does not provide versions at all this process is 
pretty same as it is working now. If addSeedDocuments provides versions for 
some(all) documents - those are excluded from calls to getDocumentVersions.

>From connector side the difference could be just in calling overloaded 
>ISeedingActivity::addSeedDocument method with second argument:
addSeedDocument(idValue) or addSeedDocument(idValue, version)
of course I understand it means much more hidden work on the other side of this 
interface :)

What do think about it?
                
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
>                 Key: CONNECTORS-567
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
>             Project: ManifoldCF
>          Issue Type: Wish
>            Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with 
> data it already has.
> Current data flow needs one call to addSeedDocuments, then call to 
> getDocumentVersions, which essentialy must fetch same data, and after that 
> one more call to processDocuments. The last one probably needs separate call 
> because it needs to fetch document body, however seeding and getting versions 
> in many cases work on very same data (and probably duplicating requests to 
> repository).
> Now - reducing number of needed request to repository by eliminating 
> getDocumentVersions call for document which have version returned by 
> addSeedDocuments could significantly reduce load.
> getDocumentVersions would still be called for older docuemnts (not returned 
> by addSeedDocuments) to check if they were modified or deleted.
> This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions

Reply via email to