Aeham Abushwashi created CONNECTORS-1153:
--------------------------------------------

             Summary: Documents crawled using manifoldcf 1.6 or earlier are 
needlessly recrawled after upgrade to 1.7 or later
                 Key: CONNECTORS-1153
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1153
             Project: ManifoldCF
          Issue Type: Bug
    Affects Versions: ManifoldCF 1.8, ManifoldCF 1.7
            Reporter: Aeham Abushwashi


After upgrading to mcf 1.7 or later, pre-existing documents are recrawled and 
re-indexed even if they have not changed in any way since their last 
pre-upgrade crawl. The impact can be significant for large manifold deployments 
with millions+ static documents.

There appear to be three contributing factors:
1. The empty transformation version of a legacy document is different from the 
initial value of "0+0!" - in PipelineObjectWithVersions#buildAddPipeline and 
IncrementalIngester#checkFetchDocument
2. Incorrect comparison of output versions in 
PipelineObjectWithVersions#buildAddPipeline where oldOutputVersion is compared 
to a VersionContext object instead of the version string, which can be obtained 
by calling VersionContext#getVersionString - if 
IPipelineSpecification#getStageDescriptionString continues to return a 
VersionContext object, a rename of the method could be useful
3. In PipelineObjectWithVersions#buildAddPipeline, a null value for 
newAuthorityNameString is not treated the same as an empty string (like it is 
in other methods)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to