1. Yes, rivers will be deprecated soon. But, and that is the good news, even without rivers, I will continue to offer a plugin-based solution for pulling data from external sources, for instance JDBC. The working name of this project is the "gatherer", which can organize data fetch processes by delegating "jobs" to gatherer nodes (similar to web crawler jobs).
1.1. From an architectural point of view, it is easier to plug an ES "push" indexer routine at that tier where data is inserted/updated into the RDBMS, to execute the action twice, one for RDBMS, second for ES. Beside performance and scalability aspects, this is much better suited to detect data anomalies (deletions, updates, or data parts that should not be indexed). Poll is inferior to push - the number of polling clients is strictly limited by server capacity while push has no limits at all. 1.2. Not sure why update requests should be something special. If you want instant indexing or even real-time indexing, you must tweak your middle tier or RDBMS to do so. You do not need a queue unless you are unsure about how to organize a store-and-forward process for reindexing data, or reset the indexer process to a safe commit point. 1.3. Yes, RabbitMQ is just another part in such a push queue. This might help in case you want to achieve reliability by loosely coupled systems (RDBMS, queue, ES are all independent) that can be managed separately. If you don't need persistent streams, you also don't need message queues. 2. The JDBC river, like all rivers, can run in many instances, distributed over as many ES nodes as you like. But this is a nightmare to manage. River failovers in case of node outages are not seamless, data from RDBMS source might be dropped (this is the motivation for message queues). The JDBC river can fetch JDBC result rows as fast as possible, even concurrently. And it uses bulk indexing and distributes the load over all nodes that hold the index. The point is, a river instance is a singleton, and if this singleton fails, you're in trouble. There is no coordination between river instances in polling data or to reset to safe points. Each river instance stands alone for itself, and that does not scale, it does not fit into the picture of all other ES components that are aware of their distributed nature. 3. Yes, this can be modeled. By using SQL column names with dot notation and array elements, you can perform hierarchical construction of JSON docs, very flexible, perfect for denormalization. If you have a concrete example, I might be able to suggest more. Jörg On Fri, Mar 7, 2014 at 4:35 PM, Joshua P <[email protected]> wrote: > Hey everyone, > > I'm trying to figure out the best way to implement Elasticsearch in our > company. > > We currently have a SQL Server that houses all of our data and want to get > some better search and analytics functionality so ES seems like an awesome > option. > > I've explored using the JDBC river to index data and I've successfully set > up elasticsearch with some test data. > > There are some things I'm not clear on: > > 1. The status of rivers in Elasticsearch. Another thread suggested > that they were convenience feature added at the beginning to get data out > of CouchDB and into ES easily. Are they going to be deprecated soon? It was > suggested by on of the ES developers that companies should be coming up > with their own solutions for indexing data. > 1. On that note, what does a custom indexing solution look like in > a scenario like mine? Does this mean leveraging the provided APIs to > create > a custom solution that polls the SQL server for changes? > 2. Should I be modifying our middle tier to also queue up update > requests as a user modifies data? > 3. I've read RabbitMQ can be used for queuing up bulk requests, but > isn't this just done via another river and therefore has the same > concerns > as above? > 2. Does using the JDBC river scale well as the size of your data > grows? I might have this wrong, but rivers don't allow you to take > advantage of the distributed nature of ES? > 3. The syntax on the JDBC river for creating nested documents in the > ES index. For example I have an PrimaryAddress table and an AddressAlias > table which contains alternate addresses/names for a single PrimaryAddress. > Can this be modeled in an ES index? The number of AddressAliases isn't the > same for each PrimaryAddress > > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFQoJvjtzX9%3D8%3Df0h3s3W7QedoCqVUUjacGRLwc9Yyt7A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
