[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695667#comment-14695667
 ] 

Rafa Haro commented on CONNECTORS-1162:
---------------------------------------

Hi [[~daddywri], please don't misunderstood me. I wasn't meaning to add new 
requirements, I just was trying to shed light on the problem that Tugba 
reported. What I was meaning was that, with Kafka, it would be theoretically 
possible to receive at seeding time only part of a document (a set of kafka 
messages, but not the whole document). A solution for this could be to index 
what you receive in one job and then make an update of the document in the 
final index (let's call it OutputConnector instead of index). With kafka would 
be probably impossible to retrieve the whole document if you need to reindex, 
so in that situation, the OutputConnector API would have to support updates 
instead of reindex. I was just asking if that is possible, not requiring it 
:-). Anyway, Kafka seems not to be suitable for a Repository Connector. That 
was the reason I created the issue only for a Output Connector if I remember 
correctly.



> Apache Kafka Output Connector
> -----------------------------
>
>                 Key: CONNECTORS-1162
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>
>         Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to