[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695379#comment-14695379
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-----------------------------------------

I think that Kafka API doesn't have a method to fetch a document with its 
document identifier because Kafka is mainly designed as messaging queue instead 
of storing documents with some path or ID. But, if we want to fetch documents 
one by one we can use message offsets as their document ID. We can seek to that 
offset and fetch a single message from the queue. So, this method might solve 
our problem but I think it's going to be a little bit slower comparing to 
continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get 
a single message. Instead of that, there is a poll method which fetches 
ConsumerRecords that contains all of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data 
later in processDocuments method.

> Apache Kafka Output Connector
> -----------------------------
>
>                 Key: CONNECTORS-1162
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>
>         Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to