[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

Karl Wright (JIRA) Mon, 22 Jun 2015 01:21:56 -0700

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595486#comment-14595486
 ]


Karl Wright commented on CONNECTORS-1162:
-----------------------------------------

Hi Tugba,

Without some careful analysis, I can't recommend a specific approach.  I will 
try to find time to read the Kafka documentation later today.

Some general principles though:

(1) If there is a way to check on the status of a document in Kafka, then you 
can turn an asynchronous "send" into a synchronous one, simply by checking on 
the status of the document in addOrReplaceDocument() until you find out what 
happened to it before returning to the caller.  The right way to do this 
depends on *how* you check on the document status; you obviously don't want to 
busy-wait.  If Kafka has a notion of call-back (like Zookeeper), then no busy 
waiting is needed, just careful object construction with synchronizers.  If the 
status check requires a call from your connector, then you will want to sleep 
for a second or so at a time and check.

(2) If there is NO way to check on the document status, then really the only 
option is to send the document and forget it.

(3) If Kafka is modeling a queue, where documents are "sent" by adding them to 
a queue, but are really only sent in a batch only when the queue gets large 
enough, then you will want to model it on the Amazon Cloud Search connector, 
which demonstrates how to handle that case properly.

Thanks -- I will have more detailed commentary later.


> Apache Kafka Output Connector
> -----------------------------
>
>                 Key: CONNECTORS-1162
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>         Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

Reply via email to