Honestly I do not remember how the dimension exclusion vs dimension inclusion stuff works. I have to look it up every time. If you look at any segment for that datasource in the Coordinator Console, it should give you a list of dimensions and metrics. Do they match what you expect?
On Sun, Oct 21, 2018 at 9:17 PM alex.rnv...@gmail.com <alex.rnv...@gmail.com> wrote: > > > On 2018/10/19 14:42:18, Charles Allen <charles.al...@snap.com.INVALID> > wrote: > > This is a good callout. Those numbers still seem very slow. One item I'm > > curious of is if you are dropping the id when you index, or if the id is > > also being indexed into the druid segments. > > > > With how druid does indexing, it dictionary encodes all the dimension > > values. So the cardinality of rows is a factor of QueryGranularity and > the > > cardinality of dimension value tuples per query granularity "bucket". > This > > allows dynamic slice and dice on the data. But if it is accidentally > > including a dimension with very high cardinality (like ID) in the > > dictionary encoding, then it is not able to make efficient use of > roll-up. > > > > In order to facilitate dynamic slice and dice, the theta sketches need to > > have *some* kind of object stored per dimension tuple per query > granularity > > (but only if the tuple appears in that bucket). So you can reduce the > > number of things that get read off of disk by trying to increase the > > rollup. Usually this is done by dropping or reducing high cardinality > > dimensions, but can also be done by changing the query granularity. > > > > Another trick is to use topN or Timeseries. In general, those query types > > are able to able to have better optimizations since they have a very > > limited scope use case. > > > > Now, to Theta Sketches itself, I am not as familiar with the Theta > Sketches > > code paths. It is possible there are performance gains to be had. > > > > Hope this helps, > > Charles Allen > > > > > > On Fri, Oct 19, 2018 at 3:38 AM alex.rnv...@gmail.com < > alex.rnv...@gmail.com> > > wrote: > > > > > Hi Druid devs, > > > I am testing Druid for our specific count distinct estimation case. > Data > > > was ingested via Hadoop indexer. > > > When simplified, it has following schema: > > > timestamp key country theta-sketch<id> event-counter > > > So, there are 2 dimensions, one counter metric, one theta sketch > metric. > > > Data granularity is a DAY. > > > Data source in deep storage is 150-200GB per day. > > > > > > I was doing some test runs with our small test cluster (4 Historical > > > nodes, 8 CPU, 64GB RAM, 500SSD RAM). I admit with this RAM-SSD ratio > and > > > number of nodes it is not going to be fast. The question though is in > > > theta-sketches performance compared to counters aggregation. The > difference > > > is an order of magnitude. E.g.: GroupBy query for a single key, > aggregated > > > on 7 days: > > > event-counters - 30 seconds. > > > theta-sketches - 7 minutes. > > > > > > Theta Sketch aggregation implies more work than summing up numbers of > > > course. But Theta Sketch documentation says that union operation is > very > > > fast. > > > > > > I did some profiling of one of Historical nodes. Most of CPU time is > spent > > > in > > > > io.druid.query.aggregation.datasketches.theta.SketchObjectStrategy.fromByteBuffer(ByteBuffer, > > > int). Which I think is moving Sketch objects from off-heap to managed > heap. > > > To be precise, time is spent in sketch library methods > > > com.yahoo.memory.WritableMemoryImpl.region > > > com.yahoo.memory.Memory.wrap > > > > > > Do not think anything is wrong with this code, except for why is it > called > > > so many times. > > > Which leads to main question. I do not really understand how > theta-sketch > > > is stored in columnar database. Assuming it is stored same way as > counter, > > > it means that for every combination of "key" and "country" (dimensions > from > > > above) - there is a theta sketch structure to be stored. In our case > "key" > > > cardinality is quite high. Hence so many Sketch structure accesses in > Java. > > > Looks extremely ineffective. Again, it is just an assumption. Please > excuse > > > me if am wrong here. > > > > > > If you continue thinking in this direction, in terms of performance it > > > makes sense to store one Theta sketch for every dimension value, so > instead > > > of having cardinality(key) * cardinality(countries) entries there will > be > > > cardinality(key) + cardinality(countries) sketches. In this case it > looks > > > like an index, not a part of columnar storage itself. > > > Queries for 2 dimensions are easy, as there is only one INTERSECTION > to be > > > done. It all looks like a natural thing to do for sketches, as there > will > > > be a win in terms of storage and query performance. > > > My question is if I am right or wrong in my assumptions. If my > > > understanding is not correct and sketches are already stored in optimal > > > way, could someone give advice on speeding up computations on a single > > > Historical node? Otherwise, wanted to ask if there is an attempt or > > > discussion to use sketches in the way I described. > > > Thanks in advance. > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > Charles, > thank you for quick reply. > My data schema part in hadoop indexing spec looks like (we copy data from > TSV): > "dataSchema" : { > "dataSource" : "theta-poc", > "granularitySpec" : { > "type" : "uniform", > "segmentGranularity" : "day", > "queryGranularity" : "none", > "intervals" : ["2018-07-14/2018-07-15"] > }, > "parser" : { > "type" : "hadoopyString", > "parseSpec" : { > "format" : "tsv", > "dimensionsSpec" : { > "dimensions" : [ > "country", > "region", > "supplierID", > "segmentID" > ] > }, > "timestampSpec" : { > "format" : "auto", > "column" : "time" > }, > "columns" : [ > "time", > "ipAddress", > "country", > "region", > "URL", > "referrer", > "userAgent", > "userID", > "supplierID", > "segmentID", > "requestID" > ], > "listDelimiter" : "," > } > }, > "metricsSpec" : [ > { > "name" : "count", > "type" : "count" > }, > { > "type" : "thetaSketch", > "name" : "user_unique_theta", > "fieldName" : "userID", > "isInputThetaSketch": false, > "size": 16384 > } > ] > } > So, most of columns are skipped, high cardinality column is userID, which > we only store as a sketch. Dimension columns has cardinalities: > country - 250 > region - 5K (the most frequent value is empty though) > supplier - 1K > segmentID - 100K > There are dependency relations between region and country, and > supplier-segmentID, but still total number of permutations is quite high. > Please, let me know if there is something suspicious here. > > I tried Timeseries query instead of GroupBy, it is indeed faster, looks > like it makes sense for us to have another monthly dataset, instead of > aggregating data from daily granularity dataset. > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > For additional commands, e-mail: dev-h...@druid.apache.org > >