[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646925#comment-14646925
 ] 

Karl Wright commented on CONNECTORS-1162:
-----------------------------------------

bq. Are we going to get topic messages from the beginning or as from the job 
started?

It is standard practice for a job to represent all documents in a repository, 
unless there is an explicit way in the UI to limit the documents taken based on 
timestamp.  I don't think such a UI feature is necessary for the first version 
of the Kafka connector, though.

bq. Also, I want to ask that how can I store offset value so that it can resume 
to consume when another job starts.

I assume that you mean, "how do I get ManifoldCF to crawl only the new 
documents that were created since the last job run?"  If that is correct, then 
have a look at the javadoc for the addSeedDocuments() method:

{code}
  /** Queue "seed" documents.  Seed documents are the starting places for 
crawling activity.  Documents
  * are seeded when this method calls appropriate methods in the passed in 
ISeedingActivity object.
  *
  * This method can choose to find repository changes that happen only during 
the specified time interval.
  * The seeds recorded by this method will be viewed by the framework based on 
what the
  * getConnectorModel() method returns.
  *
  * It is not a big problem if the connector chooses to create more seeds than 
are
  * strictly necessary; it is merely a question of overall work required.
  *
  * The end time and seeding version string passed to this method may be 
interpreted for greatest efficiency.
  * For continuous crawling jobs, this method will
  * be called once, when the job starts, and at various periodic intervals as 
the job executes.
  *
  * When a job's specification is changed, the framework automatically resets 
the seeding version string to null.  The
  * seeding version string may also be set to null on each job run, depending 
on the connector model returned by
  * getConnectorModel().
  *
  * Note that it is always ok to send MORE documents rather than less to this 
method.
  * The connector will be connected before this method can be called.
  *@param activities is the interface this method should use to perform 
whatever framework actions are desired.
  *@param spec is a document specification (that comes from the job).
  *@param seedTime is the end of the time range of documents to consider, 
exclusive.
  *@param lastSeedVersionString is the last seeding version string for this 
job, or null if the job has no previous seeding version string.
  *@param jobMode is an integer describing how the job is being run, whether 
continuous or once-only.
  *@return an updated seeding version string, to be stored with the job.
  */
  public String addSeedDocuments(ISeedingActivity activities, Specification 
spec,
    String lastSeedVersion, long seedTime, int jobMode)
    throws ManifoldCFException, ServiceInterruption;
{code}

For "lastSeedVersion", your connector will initially receive null.  You should 
return a seeding version string that MCF will store.  On the next job run, that 
string you returned is passed back in as "lastSeedVersion".  You can put 
whatever you like in that string, such as the date of the last crawl, or offset 
value, or whatever makes sense for your repository.

Hope this helps.

> Apache Kafka Output Connector
> -----------------------------
>
>                 Key: CONNECTORS-1162
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>
>         Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to