Re: Ideal method of indexing data into ES from SQL Server?

[email protected] Fri, 07 Mar 2014 14:21:49 -0800

1. Yes, rivers will be deprecated soon. But, and that is the good news,
even without rivers, I will continue to offer a plugin-based solution for
pulling data from external sources, for instance JDBC. The working name of
this project is the "gatherer", which can organize data fetch processes by
delegating "jobs" to gatherer nodes (similar to web crawler jobs).

1.1. From an architectural point of view, it is easier to plug an ES "push"
indexer routine at that tier where data is inserted/updated into the RDBMS,
to execute the action twice, one for RDBMS, second for ES. Beside
performance and scalability aspects, this is much better suited to detect
data anomalies (deletions, updates, or data parts that should not be
indexed). Poll is inferior to push - the number of polling clients is
strictly limited by server capacity while push has no limits at all.

1.2. Not sure why update requests should be something special. If you want
instant indexing or even real-time indexing, you must tweak your middle
tier or RDBMS to do so. You do not need a queue unless you are unsure about
how to organize a store-and-forward process for reindexing data, or reset
the indexer process to a safe commit point.

1.3. Yes, RabbitMQ is just another part in such a push queue. This might
help in case you want to achieve reliability by loosely coupled systems
(RDBMS, queue, ES are all independent) that can be managed separately. If
you don't need persistent streams, you also don't need message queues.

2. The JDBC river, like all rivers, can run in many instances, distributed
over as many ES nodes as you like. But this is a nightmare to manage. River
failovers in case of node outages are not seamless, data from RDBMS source
might be dropped (this is the motivation for message queues). The JDBC
river can fetch JDBC result rows as fast as possible, even concurrently.
And it uses bulk indexing and distributes the load over all nodes that hold
the index. The point is, a river instance is a singleton, and if this
singleton fails, you're in trouble. There is no coordination between river
instances in polling data or to reset to safe points. Each river instance
stands alone for itself, and that does not scale, it does not fit into the
picture of all other ES components that are aware of their distributed
nature.

3. Yes, this can be modeled. By using SQL column names with dot notation
and array elements, you can perform hierarchical construction of JSON docs,
very flexible, perfect for denormalization. If you have a concrete example,
I might be able to suggest more.

Jörg

On Fri, Mar 7, 2014 at 4:35 PM, Joshua P <[email protected]> wrote:

> Hey everyone,
>
> I'm trying to figure out the best way to implement Elasticsearch in our
> company.
>
> We currently have a SQL Server that houses all of our data and want to get
> some better search and analytics functionality so ES seems like an awesome
> option.
>
> I've explored using the JDBC river to index data and I've successfully set
> up elasticsearch with some test data.
>
> There are some things I'm not clear on:
>
>    1. The status of rivers in Elasticsearch. Another thread suggested
>    that they were convenience feature added at the beginning to get data out
>    of CouchDB and into ES easily. Are they going to be deprecated soon? It was
>    suggested by on of the ES developers that companies should be coming up
>    with their own solutions for indexing data.
>       1. On that note, what does a custom indexing solution look like in
>       a scenario like mine? Does this mean leveraging the provided APIs to 
> create
>       a custom solution that polls the SQL server for changes?
>       2. Should I be modifying our middle tier to also queue up update
>       requests as a user modifies data?
>       3. I've read RabbitMQ can be used for queuing up bulk requests, but
>       isn't this just done via another river and therefore has the same 
> concerns
>       as above?
>    2. Does using the JDBC river scale well as the size of your data
>    grows? I might have this wrong, but rivers don't allow you to take
>    advantage of the distributed nature of ES?
>    3. The syntax on the JDBC river for creating nested documents in the
>    ES index. For example I have an PrimaryAddress table and an AddressAlias
>    table which contains alternate addresses/names for a single PrimaryAddress.
>    Can this be modeled in an ES index? The number of AddressAliases isn't the
>    same for each PrimaryAddress
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFQoJvjtzX9%3D8%3Df0h3s3W7QedoCqVUUjacGRLwc9Yyt7A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Ideal method of indexing data into ES from SQL Server?

Reply via email to