[
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646925#comment-14646925
]
Karl Wright commented on CONNECTORS-1162:
-----------------------------------------
bq. Are we going to get topic messages from the beginning or as from the job
started?
It is standard practice for a job to represent all documents in a repository,
unless there is an explicit way in the UI to limit the documents taken based on
timestamp. I don't think such a UI feature is necessary for the first version
of the Kafka connector, though.
bq. Also, I want to ask that how can I store offset value so that it can resume
to consume when another job starts.
I assume that you mean, "how do I get ManifoldCF to crawl only the new
documents that were created since the last job run?" If that is correct, then
have a look at the javadoc for the addSeedDocuments() method:
{code}
/** Queue "seed" documents. Seed documents are the starting places for
crawling activity. Documents
* are seeded when this method calls appropriate methods in the passed in
ISeedingActivity object.
*
* This method can choose to find repository changes that happen only during
the specified time interval.
* The seeds recorded by this method will be viewed by the framework based on
what the
* getConnectorModel() method returns.
*
* It is not a big problem if the connector chooses to create more seeds than
are
* strictly necessary; it is merely a question of overall work required.
*
* The end time and seeding version string passed to this method may be
interpreted for greatest efficiency.
* For continuous crawling jobs, this method will
* be called once, when the job starts, and at various periodic intervals as
the job executes.
*
* When a job's specification is changed, the framework automatically resets
the seeding version string to null. The
* seeding version string may also be set to null on each job run, depending
on the connector model returned by
* getConnectorModel().
*
* Note that it is always ok to send MORE documents rather than less to this
method.
* The connector will be connected before this method can be called.
*@param activities is the interface this method should use to perform
whatever framework actions are desired.
*@param spec is a document specification (that comes from the job).
*@param seedTime is the end of the time range of documents to consider,
exclusive.
*@param lastSeedVersionString is the last seeding version string for this
job, or null if the job has no previous seeding version string.
*@param jobMode is an integer describing how the job is being run, whether
continuous or once-only.
*@return an updated seeding version string, to be stored with the job.
*/
public String addSeedDocuments(ISeedingActivity activities, Specification
spec,
String lastSeedVersion, long seedTime, int jobMode)
throws ManifoldCFException, ServiceInterruption;
{code}
For "lastSeedVersion", your connector will initially receive null. You should
return a seeding version string that MCF will store. On the next job run, that
string you returned is passed back in as "lastSeedVersion". You can put
whatever you like in that string, such as the date of the last crawl, or offset
value, or whatever makes sense for your repository.
Hope this helps.
> Apache Kafka Output Connector
> -----------------------------
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
> Issue Type: Wish
> Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
> Reporter: Rafa Haro
> Assignee: Karl Wright
> Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 2.3
>
> Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It
> provides the functionality of a messaging system, but with a unique design. A
> single Kafka broker can handle hundreds of megabytes of reads and writes per
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use
> Kafka as a feeding system for streaming BigData processes, both in Apache
> Spark or Hadoop environment. A Kafka output connector could be used for
> streaming or dispatching crawled documents or metadata and put them in a
> BigData processing pipeline
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)