[
https://issues.apache.org/jira/browse/CONNECTORS-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003357#comment-14003357
]
Karl Wright commented on CONNECTORS-916:
----------------------------------------
bq. But why do we need to keep entire document ? I thought if a job send some
documents successfully, MCF does not need keep these documents any more (so MCF
delete documents data from disk at the end of notifyOfJobCompletion()
As you pointed out, there can be errors trying to upload documents in batch to
Amazon. If the connector accumulates documents telling ManifoldCF that each
document was accepted by the connector, there is no way to force ManifoldCF to
resend any document to the connector if the upload to Amazon fails later.
But, if the connector keeps a local file-based image of what should be sent to
Amazon, and tries to update Amazon at the end of each job run, then this can be
retried many times without any loss of data. The rule is that the connector
must keep around *all* of the data in the chunk that was refused by Amazon, and
allow that data to be partially replaced in the next crawl. It would also be
really important to be sure that any Amazon errors would be reported well
enough that someone can figure out what document caused the upload to amazon to
fail, and why, so that the problem can be fixed.
> Amazon CloudSearch output connector
> -----------------------------------
>
> Key: CONNECTORS-916
> URL: https://issues.apache.org/jira/browse/CONNECTORS-916
> Project: ManifoldCF
> Issue Type: New Feature
> Components: Amazon CloudSearch output connector
> Affects Versions: ManifoldCF 1.7
> Reporter: Takumi Yoshida
> Assignee: Karl Wright
> Fix For: ManifoldCF 1.7
>
> Attachments: 0507.diff, 0520.diff, 0520_2.diff, 1.patch, 2.diff,
> 3.diff, AmazonCloudSearchParam.java, AmazonCloudSearchSpecs.java,
> exception_handling.diff, exception_handling_2.diff, licenselist.txt
>
>
> I wrote some codes snipetts of output connector for Amazon CloudSearch.
> I would like you to review my code. You can crawl web site and feed HTML page
> to Amazon CloudSearch.
> but it is not perfectly completed followoing reason.
> - does not write any codes for configuration page.
> - supporting file type is only HTML
> Thank you for your time,
> Takumi Yoshida
--
This message was sent by Atlassian JIRA
(v6.2#6252)