[ 
https://issues.apache.org/jira/browse/CONNECTORS-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693059#comment-13693059
 ] 

Karl Wright commented on CONNECTORS-735:
----------------------------------------

In order to keep track of the history of a document in the queue, a fairly 
expensive schema change would be required.  This would be necessary if we were 
going to keep track of when a document had first appeared in the queue.  
Furthermore, ManifoldCF doesn't guarantee at this point that the document won't 
be removed and readded.  Keeping track of that would require a serious amount 
of history.  So I don't think this part of the request is feasible.

In order to ingest the date of last check, an ingestion would need to happen on 
every check, which would completely defeat the purpose of incremental crawling. 
 So that part of the request is infeasible as well.

It *is* possible to add the current indexing date.  Indeed, this could be done 
in many different ways, including through a Tika plugin.  We would never want 
to include the current indexing date in the ManifoldCF version info, however, 
since that too would defeat incremental crawling.

                
> Include crawling date as metadata in OutputConnector
> ----------------------------------------------------
>
>                 Key: CONNECTORS-735
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-735
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework core
>    Affects Versions: ManifoldCF 1.2
>            Reporter: Stephane Gamard
>
> While datum is a nightmare (not all connectors get their dates in the same 
> manner, same way, etc etc etc) it might be interesting to leverage the 
> crawling to date some volatile media (such as web). 
> In case of webcrawling there are 3 dates that can certainly be inferred from 
> the crawler's activity: 
> - Date of page first appeared in queue (somewhat loosely equivalent to a 
> created date)
> - Date of last checked by the crawler (might not reflect a version update, 
> content could still be exactly the same)
> - Date of last update (since the URL exists in the queue, it might have 
> changed over time and the crawler m ight know about this). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to