[
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695444#comment-14695444
]
Rafa Haro commented on CONNECTORS-1162:
---------------------------------------
Hi [~tugbadogan]. The aim of using Kafka as a repository connector within
ManifoldCF is actually for uses cases where Kafka can be used for, somehow,
transporting something that can be reconstructed as a "Document" that you would
like to index or push to an output connector. So, the intention shouldn't be to
maintain the same messages Kafka structure within the repository connector
(i.e. a Record in Kafka shouldn't be equivalent to a RepositoryDocument). That
doesn't make any sense (for me at least). At the repository connector, I would
do like a reduce stage and will join together all the fields for the same
"document". In order to do that, the topic of all kafka messages must
correspond to the Document URI or identifier. That is something that you have
to impose to the integrator. You can seed the topics and you should create
different document identifiers if new messages for the same topic/document has
arrived at seeding time in the next job.
Now, the situation that could happen is that you are not going to be able to
rebuild the whole document using the already consumed data for a topic and the
new data coming from the stream. Under that situation, an OutputConnector
should allow you to update the document and not replace it. If that possible
[~daddywri]?
> Apache Kafka Output Connector
> -----------------------------
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
> Issue Type: Wish
> Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
> Reporter: Rafa Haro
> Assignee: Karl Wright
> Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 2.3
>
> Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It
> provides the functionality of a messaging system, but with a unique design. A
> single Kafka broker can handle hundreds of megabytes of reads and writes per
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use
> Kafka as a feeding system for streaming BigData processes, both in Apache
> Spark or Hadoop environment. A Kafka output connector could be used for
> streaming or dispatching crawled documents or metadata and put them in a
> BigData processing pipeline
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)