[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695444#comment-14695444
 ] 

Rafa Haro commented on CONNECTORS-1162:
---------------------------------------

Hi [~tugbadogan]. The aim of using Kafka as a repository connector within 
ManifoldCF is actually for uses cases where Kafka can be used for, somehow, 
transporting something that can be reconstructed as a "Document" that you would 
like to index or push to an output connector. So, the intention shouldn't be to 
maintain the same messages Kafka structure within the repository connector 
(i.e. a Record in Kafka shouldn't be equivalent to a RepositoryDocument). That 
doesn't make any sense (for me at least). At the repository connector, I would 
do like a reduce stage and will join together all the fields for the same 
"document". In order to do that, the topic of all kafka messages must 
correspond to the Document URI or identifier. That is something that you have 
to impose to the integrator. You can seed the topics and you should create 
different document identifiers if new messages for the same topic/document has 
arrived at seeding time in the next job.

Now, the situation that could happen is that you are not going to be able to 
rebuild the whole document using the already consumed data for a topic and the 
new data coming from the stream. Under that situation, an OutputConnector 
should allow you to update the document and not replace it. If that possible 
[~daddywri]? 

> Apache Kafka Output Connector
> -----------------------------
>
>                 Key: CONNECTORS-1162
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>
>         Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to