[ 
https://issues.apache.org/jira/browse/CONNECTORS-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003357#comment-14003357
 ] 

Karl Wright commented on CONNECTORS-916:
----------------------------------------

bq. But why do we need to keep entire document ? I thought if a job send some 
documents successfully, MCF does not need keep these documents any more (so MCF 
delete documents data from disk at the end of notifyOfJobCompletion()

As you pointed out, there can be errors trying to upload documents in batch to 
Amazon.  If the connector accumulates documents telling ManifoldCF that each 
document was accepted by the connector, there is no way to force ManifoldCF to 
resend any document to the connector if the upload to Amazon fails later.

But, if the connector keeps a local file-based image of what should be sent to 
Amazon, and tries to update Amazon at the end of each job run, then this can be 
retried many times without any loss of data.  The rule is that the connector 
must keep around *all* of the data in the chunk that was refused by Amazon, and 
allow that data to be partially replaced in the next crawl.  It would also be 
really important to be sure that any Amazon errors would be reported well 
enough that someone can figure out what document caused the upload to amazon to 
fail, and why, so that the problem can be fixed.


> Amazon CloudSearch output connector
> -----------------------------------
>
>                 Key: CONNECTORS-916
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-916
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Takumi Yoshida
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>         Attachments: 0507.diff, 0520.diff, 0520_2.diff, 1.patch, 2.diff, 
> 3.diff, AmazonCloudSearchParam.java, AmazonCloudSearchSpecs.java, 
> exception_handling.diff, exception_handling_2.diff, licenselist.txt
>
>
> I wrote some codes snipetts of output connector for Amazon CloudSearch.
> I would like you to review my code. You can crawl web site and feed HTML page 
> to Amazon CloudSearch.
> but it is not perfectly completed followoing reason.
> - does not write any codes for configuration page.
> - supporting file type is only HTML
> Thank you for your time,
>  Takumi Yoshida



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to