ZacBlanco commented on PR #554: URL: https://github.com/apache/datasketches-java/pull/554#issuecomment-2089000548
> With a string-size of 64 bytes and 1T rows, you're looking at 64T of data. Across 100M groups that's roughly 640k of data per group and 100K items per sketch in the group." > Doesn't this assume you are analyzing only one column? Yes, that example was just for one column, but it depends on the user query as to whether or not we're creating sketches for one column, or multiple. I was just using one column to demonstrate a simple scenario for memory usage. > It is my understanding that it is at this group level where you want to analyze the distribution of items, where you might have a KLL sketch per column, thus many sketches just for this group. And, of course, all the 100M groups would be configured with sketches as well implying a huge number of sketches for the entire query analysis -- thus the concern about memory usage. Yes, this is correct. You could imagine this situation arises when a user writes something like: ```sql SELECT kll_sketch(<column_x>), kll_sketch(<column_y>), ... FROM <table> GROUP BY <column1>, <column2>, ... ``` and `<table>` is the hypothetically very large table. I couldn't tell you exactly why a user might write a query like this, but it's something we have to support. We don't want to engine to crash if something like this were to be submitted. > Can you give me an example of how you use the distribution information from these sketches along with the user chosen predicates to help you with query optimization? > What are you looking for in the distributions from these sketches? > What decisions can you (or your query planning model ) make from these distributions? I believe there may be some confusion with how I described uses for the KLL sketches. Let me try to clarify with some more examples. I used the above `GROUP BY` query above as an example of something that the DB user could write. How they utilize the KLL sketch is entirely up to them. But in implementing support for generating KLL sketches for users in the database, we need to have memory tracking for this scenario, which is why in the implementation of the KLL sketch aggregation function implementation, I had some code similar to the following (referencing my initial response on this PR): ```java memoryUsage -= sketch.getSerializedSizeBytes(); sketch.update(item); memoryUsage += sketch.getSerializedSizeBytes(); ``` Unfortunately, due to engine design, we can't really special-case function implementations easily for queries with and without these `GROUP BY` statements. If queries don't have `GROUP BY` statements, memory isn't as large of a concern. Most aggregation functions' state rarely exceeds the roughly the 1-megabyte range, so poor memory utilization estimates have a smaller probability of causing issues. However, because of the necessity to implement the memory size tracking, it caused poor performance for queries even without a `GROUP BY` clause. Since the root cause of the poor performance was `KllItemsSketch#getSerializedSizeBytes`, I was motivated to open this PR. The hypothetical scenario of a large `GROUP BY` query described above is a little different from how we (the DB engine authors) intend to use the KLL sketches. We aren't doing `GROUP BY` queries when we try to generate statistics for a table, so the amount of memory required for the aggregation is far lower and less of a concern, but the query runtime performance is. Essentially generating statistics for a table amounts to running a query like ```sql SELECT kll_sketch(<column1>), kll_sketch(<column2>), ... FROM <table> ``` The resulting sketches from the query are stored in a location where they can be queried later by the optimizer. Suppose a user comes along and writes the query ```sql SELECT * from <table> WHERE <column1> <= 5 ``` The optimizer will analyze the query and generate a "Filter" operator representing `<column1> <= 5`. The optimizer attempts to estimate the output row count from all nodes in the SQL query plan. For the "Filter" operator, the optimizer will get the sketch stored for `<column1>` and use it to compute the proportion of rows that would be returned by the table scan which satisfy the filter `<= 5`. To motivate the reason why we want accurate filter estimates, take the following scenario: a user has a very complex query with multiple (>2) `JOIN` statements. You can improve query performance by re-arranging the order which tables are joined together by joining tables with smaller result sets earlier. So, having an accurate estimate of the number of records that satisfy the filter on a table scan can significantly improve query runtime. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
