We are currently evaluating alternatives for two of our use cases because we are slowly hitting the roof regarding performance. Elastic Search looks like to be a pretty good candidate for us! However, I wondered if someone out there in this group could tell me if it really is the right choice for us.
Our first use case currently manages 40 Million+ Documents (Maps) that could easily be structured as JSON documents with an average size of 10-20 KB per document. Documents are identified by a unique id prefixed with some kind of a "partition" identifier, a la <partition-name>-<UUID>. Those logical partitions are not balanced and contain from 50.000-100.000 up to a few million documents. Partitions typically grow slow, but in large batches, e.g. if a partition grows, then thousands of documents are added in a batch. Once a partition is populated, then around 25% of the documents within the partitition are updated around 3 times a day. Each partition must be read in batches of around 40.000-50.000 documents around 3 times a day. Documents are fetched on an id basis, so we have to hit the database with a list of a few thousand ids. Currently our ids are evenly spread within a partition (due to the usage of a UUID). We, however, plan to change this, so that data is read together often, has ids that are closely related to each other (in an alphabetical sense). We are currently using a combination of MySQL and Lucene with a pretty trivial MySQL schema - basically a primary key and a blob where the documents are stored. We are then indexing documents with Lucene. The Lucene Index is queried by the application for document ids that are then fetched from the database. For indexing we use many of the Lucene gems in order to provide rich query possibilities, so we need full power for manual indexing configuration (via code extensions ?) and query building / parsing. The bad thing is, that one of our requirements is to immediately search for freshly stored or updated documents - but we have some (!) time, since there's no user sitting on the other side staring on the screen :o) We currently index right after storing, which is typically again done in batches of around 40.000-50.000 documents and indexing currently takes a few seconds. Our second use case are time range aware aggregations among many 100 millions of rows. E.g. How many clicks did we have in the last 31 days - returned as a series of data grouped by day. Data is again structured in partitions, where in this case a partition is a combination of numeric values (a composite primary key). We are currently using denormalized MySQL tables with some strategic indices to support typical where and group by clauses. We have to insert/update up to a few million rows per hour, where 98% of all incoming rows are updates and 2% are new rows. Queries must be have a very low latency and along with aggregation queries we will have many documents concurrently accessed by primary key. Thanks in advance for your advice! -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4f7a6daf-0949-437c-a70c-50a5d65f8dcb%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
