Re: how to ensure that rivers are equally distributed across nodes in the cluster

loïc moriamé Mon, 31 Mar 2014 07:09:54 -0700

I Jörg !
Did you have any update on your über plugin ?

I'm really interested, because I want to plug my MsSQL DB with 
ElasticSearch.
I can't modify the software, but I want to have a "near real time" 
integration between my MsSQL DB and ES.


So I hope I can use your work.

Le jeudi 26 décembre 2013 13:37:36 UTC+1, Jörg Prante a écrit :
>
> Rivers were once introduced for demo purposes to load quickly some data 
> into ES and make showcases from twitter or wikipedia data. 
>
> The Elasticsearch team is now in favor of Logstash.
>
> I start this gatherer plugin for my uses cases where I am not able to use 
> Logstash. I have very complex streams, e.g. ISO 2709 record formats with 
> some hundred custom transformations in the data, that I reduce to primitive 
> key/value streams and RDF triples. Also I plan to build RDF feeds for 
> semantic web/linked data platforms, where ES is the search engine.
>
> The gatherer "uber" plugin should work like this:
>
> - it can be installed on one or more nodes and provides a common bulk 
> indexing framework
>
> - a gatherer plugin registers in the cluster state (on node level)
>
> - there are standard capabilities, but a gatherer plugin capability can be 
> extended in a live cluster by submitting code for inputs, codecs, and 
> filters, picked up by a custom class loader (for example, JDBC, and a 
> driver jar, and tabular key/value output)
>
> - a gatherer plugin is idling, and accepts jobs in form of JSON commands 
> (defining the selection of inputs, codecs, and filters), for example, an 
> SQL command
>
> - if a gatherer is told to distribute the jobs fairly and is too busy 
> (active job queue length), it forwards them to other gatherers (other 
> methods are crontab-like scheduling), and the results of the jobs (ok, 
> failed, retry) are registered also in the cluster state (maybe an internal 
> index is better because there can be tens of thousands such jobs)
>
> - a client can ask for the state of all the gatherers and all the job 
> results
>
> - all jobs can be partitioned and processed in parallel for maximum 
> throughput
>
> - the gatherer also creates metrics/statistics of the jobs successfully 
> done
>
> Another thing I find important is to enable scripting for processing the 
> data streams (JSR 223 scripting, especially Groovy, Jython, Jruby, 
> Rhino/Nashorn)
>
> Right now there is no repo, I plan to kickstart the repo in early 2014.
>
> Jörg
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/39bf0bff-b57b-4865-8d19-a062d9a85544%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: how to ensure that rivers are equally distributed across nodes in the cluster

Reply via email to