Hi Peter, It sounds like Elasticsearch is perfect for both of your use cases. The trick is getting a server cluster (or maybe one for each use case) set up appropriately that will meet your needs. Elasticsearch scales seamlessly both horizontally and vertically, and will take full advantage of whatever hardware you give it. Figuring out how many servers you need and how powerful they need to be will take some time and experimentation, but I have no doubt that Elasticsearch can handle your data needs.
Since you are already using Lucene, you will be familiar with may of the capabilities of Elasticsearch already. Searching for newly indexed data is not a problem since refresh rate is configurable, but there are performance trade-offs. In many cases Elasticsearch can even function as the primary data store, eliminating the need to store your data in two places. Have you seen the new aggregations feature coming in Elasticsearch 1.0, yet? It sounds like it could help with your second use case. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-aggregations.html My company StackSearch provides hosted Elasticsearch (on both Amazon EC2 and Rackspace, in all their respective data centers) at http://qbox.io. We also provide consulting services for creating and managing data flow strategies, and are an official reseller of Elasticsearch (the company) support contracts. Please let me know if there is anything we can do to help you. On Thursday, December 19, 2013 1:46:09 PM UTC-6, [email protected] wrote: > > We are currently evaluating alternatives for two of our use cases because > we are slowly hitting the roof regarding performance. Elastic Search looks > like to be a pretty good candidate for us! However, I wondered if someone > out there in this group could tell me if it really is the right choice for > us. > > Our first use case currently manages 40 Million+ Documents (Maps) that > could easily be structured as JSON documents with an average size of 10-20 > KB per document. Documents are identified by a unique id prefixed with some > kind of a "partition" identifier, a la <partition-name>-<UUID>. Those > logical partitions are not balanced and contain from 50.000-100.000 up to a > few million documents. Partitions typically grow slow, but in large > batches, e.g. if a partition grows, then thousands of documents are added > in a batch. Once a partition is populated, then around 25% of the documents > within the partitition are updated around 3 times a day. Each partition > must be read in batches of around 40.000-50.000 documents around 3 times a > day. Documents are fetched on an id basis, so we have to hit the database > with a list of a few thousand ids. Currently our ids are evenly spread > within a partition (due to the usage of a UUID). We, however, plan to > change this, so that data is read together often, has ids that are closely > related to each other (in an alphabetical sense). We are currently using a > combination of MySQL and Lucene with a pretty trivial MySQL schema - > basically a primary key and a blob where the documents are stored. We are > then indexing documents with Lucene. The Lucene Index is queried by the > application for document ids that are then fetched from the database. For > indexing we use many of the Lucene gems in order to provide rich query > possibilities, so we need full power for manual indexing configuration (via > code extensions ?) and query building / parsing. The bad thing is, that one > of our requirements is to immediately search for freshly stored or updated > documents - but we have some (!) time, since there's no user sitting on the > other side staring on the screen :o) We currently index right after > storing, which is typically again done in batches of around 40.000-50.000 > documents and indexing currently takes a few seconds. > > Our second use case are time range aware aggregations among many 100 > millions of rows. E.g. How many clicks did we have in the last 31 days - > returned as a series of data grouped by day. Data is again structured in > partitions, where in this case a partition is a combination of numeric > values (a composite primary key). We are currently using denormalized MySQL > tables with some strategic indices to support typical where and group by > clauses. We have to insert/update up to a few million rows per hour, where > 98% of all incoming rows are updates and 2% are new rows. Queries must be > have a very low latency and along with aggregation queries we will have > many documents concurrently accessed by primary key. > > Thanks in advance for your advice! > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c0fd9a2-0858-42ef-b226-6c8528576d16%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
