[ 
https://issues.apache.org/jira/browse/CONNECTORS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098813#comment-14098813
 ] 

Karl Wright commented on CONNECTORS-1009:
-----------------------------------------

Hi Prasad,

Let's look at this one stretch of the standard out dump you made:

{code}
Document ID : workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;3.6 
found version : 3.6:select * from cmis:folder where 
contains('PATH:"/app:company_home/app:shared/cm:content"')
Running checkDocumentNeedsReindexing
Running IncrementalIngester:checkFetchDocument
Running IncrementalIngester:basicSpecification.getOutputCount()
IncrementalIngester:oldDocumentVersion : 3.6:select * from cmis:folder where 
contains('PATH:"/app:company_home/app:shared/cm:content"')
IncrementalIngester:oldDocumentVersion : 3.6:select * from cmis:folder where 
contains('PATH:"/app:company_home/app:shared/cm:content"') newDocumentVersion : 
3.6:select * from cmis:folder where 
contains('PATH:"/app:company_home/app:shared/cm:content"')
IncrementalIngester:oldParameterVersion :       newParameterVersion : 
IncrementalIngester:oldOutputVersion :  newOutputVersion : 
Adding to scanDocuments : 
workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;3.6
Document ID : workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;3.5 
found version : 3.6:select * from cmis:folder where 
contains('PATH:"/app:company_home/app:shared/cm:content"')
{code}

In both cases I can't tell whether it decided to reindex or not.  The version 
numbers are identical although the document ID's are different: in one case the 
ID is clearly referring to the version 3.5, and in the other case it's 
referring to version 3.6.  So this is what might be happening:

- the nodeID describes the a document revision rather than a single document 
with multiple revisions, as you surmised;
- Getting the latest version of a document revision (using  "document = 
document.getObjectOfLatestVersion(false);") doesn't do the right thing in this 
case; it maps the current nodeID to the later nodeID.  So when we compute the 
version string, we always compute it as being the version label of the 
*current* revision, not the revision described by the ID.

So, I'll have to think about what the solution will be.  Removing the mapping 
to latest version might well be the right thing to do in all cases.  Indeed, 
what I think we *really* want to do is to request only the latest versions in 
the CMIS seeding query itself, if there's a way to do that.  If there was a 
document ID available rather than a revision ID available, that would be 
preferred, but I don't know if there is, so this will require careful thought.

I won't be able to look at this any further until Sunday evening.




> Cmis Repository Connector does not handle Document updating properly
> --------------------------------------------------------------------
>
>                 Key: CONNECTORS-1009
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1009
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: CMIS connector
>            Reporter: Prasad Perera
>            Priority: Minor
>         Attachments: std_logs.txt, std_prints.diff
>
>
> As a part of the Fix for CONNECTORS-1004, It seems CmisRepositoryConnector 
> does not handle document updating properly.
> Case Scenario:
> * Create a continuous crawling job using  CmisRepositoryConnector.
> * Update a document on repository end.
> * The document keep submitting to OutputConnector at each crawling interval 
> though it was not updated afterwards.
> One possible Fix needed I is : @ CmisRepositoryConnector:processDocument,
>  activities.ingestDocumentWithException(nodeId, version, documentURI, rd);
> The documentURI should point to the old document URI (Now it points to the 
> latest documentURI discovered and it may seems to confuse document references 
> ?)
> Also, In ECM systems, for example in Alfresco, the documentIDs are formulated 
> with the version number as well.
> Ex: workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.0 --> 
> version 1.0
> workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.1 --> version 
> 1.1
> When we setup a query to crawl a repository folder, we discover content by 
> referring the child nodes. Because of that, now it seems to queue all the 
> document versions and submit them to OutputConnector thus producing duplicate 
> documents at the output (search) side.
> Is there a way to avoid this problem ? It will be great if the repository can 
> just take the latest document version and submit it as an update.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to