ZacBlanco commented on PR #554:
URL: 
https://github.com/apache/datasketches-java/pull/554#issuecomment-2089000548

   > With a string-size of 64 bytes and 1T rows, you're looking at 64T of data. 
Across 100M groups that's roughly 640k of data per group and 100K items per 
sketch in the group."
   >    Doesn't this assume you are analyzing only one column?
   
   Yes, that example was just for one column, but it depends on the user query 
as to whether or not we're creating sketches for one column, or multiple. I was 
just using one column to demonstrate a simple scenario for memory usage.
   
   > It is my understanding that it is at this group level where you want to 
analyze the distribution of items, where you might have a KLL sketch per 
column, thus many sketches just for this group. And, of course, all the 100M 
groups would be configured with sketches as well implying a huge number of 
sketches for the entire query analysis -- thus the concern about memory usage.
   
   Yes, this is correct. You could imagine this situation arises when a user 
writes something like:
   
   ```sql
   SELECT kll_sketch(<column_x>), kll_sketch(<column_y>), ... FROM <table> 
GROUP BY <column1>, <column2>, ...
   ```
   
   and `<table>` is the hypothetically very large table. I couldn't tell you 
exactly why a user might write a query like this, but it's something we have to 
support. We don't want to engine to crash if something like this were to be 
submitted.
   
   > Can you give me an example of how you use the distribution information 
from these sketches along with the user chosen predicates to help you with 
query optimization?
   >    What are you looking for in the distributions from these sketches?
   >    What decisions can you (or your query planning model ) make from these 
distributions?
   
   I believe there may be some confusion with how I described uses for the KLL 
sketches. Let me try to clarify with some more examples.
   
   I used the above `GROUP BY` query above as an example of something that the 
DB user could write. How they utilize the KLL sketch is entirely up to them. 
But in implementing support for generating KLL sketches for users in the 
database, we need to have memory tracking for this scenario, which is why in 
the implementation of the KLL sketch aggregation function implementation, I had 
some code similar to the following (referencing my initial response on this PR):
   
   ```java
   memoryUsage -= sketch.getSerializedSizeBytes();
   sketch.update(item);
   memoryUsage += sketch.getSerializedSizeBytes();
   ```
   
   Unfortunately, due to engine design, we can't really special-case function 
implementations easily for queries with and without these `GROUP BY` 
statements. If queries don't have `GROUP BY` statements, memory isn't as large 
of a concern. Most aggregation functions' state rarely exceeds the roughly the 
1-megabyte range, so poor memory utilization estimates have a smaller 
probability of causing issues. However, because of the necessity to implement 
the memory size tracking, it caused poor performance for queries even without a 
`GROUP BY` clause. Since the root cause of the poor performance was 
`KllItemsSketch#getSerializedSizeBytes`, I was motivated to open this PR.
   
   The hypothetical scenario of a large `GROUP BY` query described above is a 
little different from how we (the DB engine authors) intend to use the KLL 
sketches. We aren't doing `GROUP BY` queries when we try to generate statistics 
for a table, so the amount of memory required for the aggregation is far lower 
and less of a concern, but the query runtime performance is. Essentially 
generating statistics for a table amounts to running a query like
   
   ```sql
   SELECT kll_sketch(<column1>), kll_sketch(<column2>), ... FROM <table>
   ```
   
   The resulting sketches from the query are stored in a location where they 
can be queried later by the optimizer. Suppose a user comes along and writes 
the query
   
   ```sql
   SELECT * from <table> WHERE <column1> <= 5
   ```
   
   The optimizer will analyze the query and generate a "Filter" operator 
representing `<column1> <= 5`. The optimizer attempts to estimate the output 
row count from all nodes in the SQL query plan. For the "Filter" operator, the 
optimizer will get the sketch stored for `<column1>` and use it to compute the 
proportion of rows that would be returned by the table scan which satisfy the 
filter `<= 5`.
   
   To motivate the reason why we want accurate filter estimates, take the 
following scenario: a user has a very complex query with multiple (>2) `JOIN` 
statements. You can improve query performance by re-arranging the order which 
tables are joined together by joining tables with smaller result sets earlier. 
So, having an accurate estimate of the number of records that satisfy the 
filter on a table scan can significantly improve query runtime.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to