[
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694542#comment-14694542
]
Karl Wright commented on CONNECTORS-1162:
-----------------------------------------
bq. We fetch documents from Kafka as stream, so we cannot add document URI in
addSeedDocuments method.
Is there anything in Kafka corresponding to a document ID? Can you fetch
documents by ID?
bq. So, I think that I can store messages temporarily in a HashMap with unique
hashcode of each message.
This won't work because you simply can't presume that the same connector
instance will be asked to do the processing as does the seeding.
bq. However, when something happens and job restarts, we loose HashMap object
because it creates another KafkaRepositoryConnector object.
It's worse than that, because in a multi-process clustered environment,
document processing requests are not guaranteed to take place on any one
particular machine even. So this cannot work.
bq. Do you have any suggestions to hang around this problem ? Can we ingest
documents directly in the addSeedDocuments method ?
No, you can't ingest at seed document time.
Can you be more specific about the Kafka API for fetching documents? I can't
believe that you can only get *all* the documents in a single stream. There
*must* be a way to specify individual document ID's.
> Apache Kafka Output Connector
> -----------------------------
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
> Issue Type: Wish
> Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
> Reporter: Rafa Haro
> Assignee: Karl Wright
> Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 2.3
>
> Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It
> provides the functionality of a messaging system, but with a unique design. A
> single Kafka broker can handle hundreds of megabytes of reads and writes per
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use
> Kafka as a feeding system for streaming BigData processes, both in Apache
> Spark or Hadoop environment. A Kafka output connector could be used for
> streaming or dispatching crawled documents or metadata and put them in a
> BigData processing pipeline
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)