[
https://issues.apache.org/jira/browse/CONNECTORS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097084#comment-14097084
]
Karl Wright commented on CONNECTORS-1009:
-----------------------------------------
Hi Prasad,
I don't know much about CMIS, but it is essential for ManifoldCF that the
document identifier describe the current version of the each document. There
are really two possible models:
(1) The document identifier doesn't change as the document is revised, but the
data and metadata of the document DO change;
(2) A new document identifier is created on every document revision, but the
only reachable document versions are the most current ones.
The CMIS connector uses the nodeID as the document identifier:
{code}
String nodeId = documentIdentifiers[i];
{code}
It is not clear to me that all CMIS implementations behave the same way, but as
long as the operate in a manner consistent with (1) or (2) above, that is fine.
HOWEVER, continuous crawling will not work well with (2), because no document
cleanup happens, so you cannot make much use of that mode in that case.
As for documents being *duplicated* on the search side, the document's URL is
in fact the key for the search side. So if you have multiple instances of the
document, which all share the same URL, you will not get duplicates. You will
only get duplicates if each *version* of the document has a different nodeID
and URL in CMIS -- and that should also get cleaned up if you run any kind of
crawl other than continuous.
Does this describe what you are seeing?
> Cmis Repository Connector does not handle Document updating properly
> --------------------------------------------------------------------
>
> Key: CONNECTORS-1009
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1009
> Project: ManifoldCF
> Issue Type: Bug
> Components: CMIS connector
> Reporter: Prasad Perera
> Priority: Minor
>
> As a part of the Fix for CONNECTORS-1004, It seems CmisRepositoryConnector
> does not handle document updating properly.
> Case Scenario:
> * Create a continuous crawling job using CmisRepositoryConnector.
> * Update a document on repository end.
> * The document keep submitting to OutputConnector at each crawling interval
> though it was not updated afterwards.
> One possible Fix needed I is : @ CmisRepositoryConnector:processDocument,
> activities.ingestDocumentWithException(nodeId, version, documentURI, rd);
> The documentURI should point to the old document URI (Now it points to the
> latest documentURI discovered and it may seems to confuse document references
> ?)
> Also, In ECM systems, for example in Alfresco, the documentIDs are formulated
> with the version number as well.
> Ex: workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.0 -->
> version 1.0
> workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.1 --> version
> 1.1
> When we setup a query to crawl a repository folder, we discover content by
> referring the child nodes. Because of that, now it seems to queue all the
> document versions and submit them to OutputConnector thus producing duplicate
> documents at the output (search) side.
> Is there a way to avoid this problem ? It will be great if the repository can
> just take the latest document version and submit it as an update.
--
This message was sent by Atlassian JIRA
(v6.2#6252)