[ 
https://issues.apache.org/jira/browse/CONNECTORS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097619#comment-14097619
 ] 

Prasad Perera commented on CONNECTORS-1009:
-------------------------------------------

Hello Karl,

Actually yes, this is the issue. The nodeID is different for each version of 
the document thus creating a problem. 
For example, I can point out two scenarios of using a cm:document and cm:folder 
query.

When used a query : select * from cmis:document where 
contains('PATH:"/app:company_home/app:shared/cm:content/*"'),
We re-crawl nodeID as we seen at the seeding time and when the document is 
updated, we have no problem as we keep track of old nodeID and submit the new 
version to be re-indexed.

However when used a query: select * from cmis:folder where 
contains('PATH:"/app:company_home/app:shared/cm:content/"'),
when the document is updated with a new version, we add the new nodeID to the 
queue. With that, for some reason, we always fetch and submit the document for 
both old and new node IDs. I could get rid of re-submission by  changing,

ingestDocumentWithException(nodeId, version, documentURI, rd);

to 

ingestDocumentWithException(nodeId, version, oldDocumentURI (the old document 
URI), rd);

yet, since we do a discovery of child nodes for folders,

   if (baseTypeId.equals(CMIS_FOLDER_BASE_TYPE)) {

                // adding all the children for a folder
                Folder folder = (Folder) cmisObject;
                ItemIterable<CmisObject> children = folder.getChildren();
                for (CmisObject child : children) {
                    activities.addDocumentReference(child.getId(), nodeId,
                            RELATIONSHIP_CHILD);
                }

 we end up with new and old node IDs in the queue, pointing out to the latest 
version of the document and having two copies of the document at the search 
side as well.
I still could not see why we re-submit the document over and over again when 
the document was updated once.  It seems 
activities.checkDocumentNeedsReindexing becomes true  at each run of 
re-crawling when document was updated and new node ID and old node ID ingested 
with the same latest document URI

I wonder this type of nodeID changes with document version is only distinct to 
Alfresco.
I would prefer if the update document change was detected properly and the 
latest version was submitted only once as the expected behaviour.
I may be able to produce some extra std out logs if needed.




> Cmis Repository Connector does not handle Document updating properly
> --------------------------------------------------------------------
>
>                 Key: CONNECTORS-1009
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1009
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: CMIS connector
>            Reporter: Prasad Perera
>            Priority: Minor
>
> As a part of the Fix for CONNECTORS-1004, It seems CmisRepositoryConnector 
> does not handle document updating properly.
> Case Scenario:
> * Create a continuous crawling job using  CmisRepositoryConnector.
> * Update a document on repository end.
> * The document keep submitting to OutputConnector at each crawling interval 
> though it was not updated afterwards.
> One possible Fix needed I is : @ CmisRepositoryConnector:processDocument,
>  activities.ingestDocumentWithException(nodeId, version, documentURI, rd);
> The documentURI should point to the old document URI (Now it points to the 
> latest documentURI discovered and it may seems to confuse document references 
> ?)
> Also, In ECM systems, for example in Alfresco, the documentIDs are formulated 
> with the version number as well.
> Ex: workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.0 --> 
> version 1.0
> workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.1 --> version 
> 1.1
> When we setup a query to crawl a repository folder, we discover content by 
> referring the child nodes. Because of that, now it seems to queue all the 
> document versions and submit them to OutputConnector thus producing duplicate 
> documents at the output (search) side.
> Is there a way to avoid this problem ? It will be great if the repository can 
> just take the latest document version and submit it as an update.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to