Re: how to ensure that rivers are equally distributed across nodes in the cluster

[email protected] Mon, 31 Mar 2014 07:17:15 -0700

Hi Loïc,

the gatherer plugin is still very early (pre-alpha) and not ready because I
do it in my spare time.


https://github.com/jprante/elasticsearch-gatherer

Jörg




On Mon, Mar 31, 2014 at 4:09 PM, loïc moriamé <[email protected]> wrote:

> I Jörg !
> Did you have any update on your über plugin ?
>
> I'm really interested, because I want to plug my MsSQL DB with
> ElasticSearch.
> I can't modify the software, but I want to have a "near real time"
> integration between my MsSQL DB and ES.
>
> So I hope I can use your work.
>
> Le jeudi 26 décembre 2013 13:37:36 UTC+1, Jörg Prante a écrit :
>
>> Rivers were once introduced for demo purposes to load quickly some data
>> into ES and make showcases from twitter or wikipedia data.
>>
>> The Elasticsearch team is now in favor of Logstash.
>>
>> I start this gatherer plugin for my uses cases where I am not able to use
>> Logstash. I have very complex streams, e.g. ISO 2709 record formats with
>> some hundred custom transformations in the data, that I reduce to primitive
>> key/value streams and RDF triples. Also I plan to build RDF feeds for
>> semantic web/linked data platforms, where ES is the search engine.
>>
>> The gatherer "uber" plugin should work like this:
>>
>> - it can be installed on one or more nodes and provides a common bulk
>> indexing framework
>>
>> - a gatherer plugin registers in the cluster state (on node level)
>>
>> - there are standard capabilities, but a gatherer plugin capability can
>> be extended in a live cluster by submitting code for inputs, codecs, and
>> filters, picked up by a custom class loader (for example, JDBC, and a
>> driver jar, and tabular key/value output)
>>
>> - a gatherer plugin is idling, and accepts jobs in form of JSON commands
>> (defining the selection of inputs, codecs, and filters), for example, an
>> SQL command
>>
>> - if a gatherer is told to distribute the jobs fairly and is too busy
>> (active job queue length), it forwards them to other gatherers (other
>> methods are crontab-like scheduling), and the results of the jobs (ok,
>> failed, retry) are registered also in the cluster state (maybe an internal
>> index is better because there can be tens of thousands such jobs)
>>
>> - a client can ask for the state of all the gatherers and all the job
>> results
>>
>> - all jobs can be partitioned and processed in parallel for maximum
>> throughput
>>
>> - the gatherer also creates metrics/statistics of the jobs successfully
>> done
>>
>> Another thing I find important is to enable scripting for processing the
>> data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
>> Rhino/Nashorn)
>>
>> Right now there is no repo, I plan to kickstart the repo in early 2014.
>>
>> Jörg
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/39bf0bff-b57b-4865-8d19-a062d9a85544%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/39bf0bff-b57b-4865-8d19-a062d9a85544%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGVzHp8vCsf%2BY1%2B9fVy%2BatkQ%2ByPejoMDex_CPwB-mwAsA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: how to ensure that rivers are equally distributed across nodes in the cluster

Reply via email to