Re: Group by + where clause

Sarnath Fri, 11 Dec 2015 17:50:34 -0800

Hi Luke,

Few points:


1)
As I mentioned above, the KV pairs corresponding to an aggregation are
stored as 1 elastic search document. ES indexes on all fields and takes
care of the  REST API DSL.
The KV pairs are same as what kylin stores in hbase. Kylin, as per my
understanding, breaks the KV pairs among rowkey and columns. The dimensions
go to rowkey and metrics go to columns. And I believe that's the reason why
Kylin will do full-scan for the query doled out by Seshu. ES does not
differentiate between metrics, dimensions. It indexes everything. Hence the
range queries mentioned by Seshu should also run pretty fast with ES. We
will experiment that and report here as well.

2)We don't do SQL to REST API conversion yet. The entire REST API DSL is
provided by ES. So we don't sweat anything on the REST API.

3)
In the Blog, we only claim on the fluctuation in performance, while
filtering group-by on different dimensions. We don't claim on performance.
But we will get there soon.

4)
Druid, as I understood from Julian's email, does not build cube. It stores
raw data in a sorted order so that OLAP queries (group by) can be answered
fast without building cube.

Best,
Sarnath
On Dec 12, 2015 5:43 AM, "Luke Han" <[email protected]> wrote:

> Would you mind to share more detail about how you indexing these
> aggregations and how your query will convert to ES API?
>
> BTW, does this similar to Druid doing?
>
>
> >Multiple indexing is what we take advantage of. ES, by default indexes on
> >all fields of a document. We store a multidimensional aggregation as an ES
> >document whose fields are the various dimensions and metrics associated
> >with the aggregation.
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Sat, Dec 12, 2015 at 3:05 AM, Sarnath <[email protected]> wrote:
>
> > >>>> Sorted indexes are a viable approach to OLAP storage — Druid[1] does
> > it, and so does SAP HANA. The idea is that if you sort and compress your
> > data it becomes very compact, so you can do very fast scans. So fast that
> > you don’t need to pre-aggregate it.
> >
> > Yes, the problem (which I think you have covered below) is that you can
> > only sort on a column of interest... And you can sort again on other
> > columns among all rows where the first column has the same value.... But
> > then, if you were to filter by second column - you will still need to
> scan
> > entire table. Very similar to the analogy in our blog.(search for all
> > English words whose second letter is 'a')
> > And, as your filtering query becomes complex, it becomes very difficult.
> I
> > believe Druid is optimized for time series analytics (how much by minute,
> > hour, day etc..). Not sure about multidimensional aggregations...
> >
> > >>>> Elasticsearch is an index but it is not an OLAP index - their use
> case
> > does not call for compressing numeric data, and they optimize for point
> > lookups rather than scans.
> >
> > We use ES only to serve pre-aggregated cube data and not to index the raw
> > data to produce OLAP cubes.
> >
> > >>>>> The best OLAP indexes are able to combine multiple indexes. E.g.
> take
> > two not-very-selective conditions and make a selective condition. The
> > poorer ones can only use one index, so to get coverage you need to build
> > more indexes.
> >
> > Can you elaborate on Not-so-selective condition? I am a bit lost on the
> > context.
> >
> > Multiple indexing is what we take advantage of. ES, by default indexes on
> > all fields of a document. We store a multidimensional aggregation as an
> ES
> > document whose fields are the various dimensions and metrics associated
> > with the aggregation. Thus the cube can be sliced and diced on any
> > dimension and filtered on metrics as well.. And again, this indexing is
> > completely different from indexing on raw data or table data. We are
> > dealing with data cubes here.
> >
> > Best,
> > Sarnath
> >
>

Re: Group by + where clause

Reply via email to