[GitHub] [druid] suneet-s commented on a change in pull request #9317: ANY Aggregator should not skip null values implementation

GitBox Tue, 11 Feb 2020 09:09:28 -0800

suneet-s commented on a change in pull request #9317: ANY Aggregator should not 
skip null values implementation
URL: https://github.com/apache/druid/pull/9317#discussion_r377773517


 ##########
 File path: docs/querying/aggregations.md
 ##########
 @@ -238,7 +238,7 @@ Note that queries with first/last aggregators on a segment 
created with rollup e
 
 (Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, 
and should only be specified as part of queries.
 
-If `druid.generic.useDefaultValueForNull=true` aggregation can returns the 
default value for null and does not prefer "non-null" values over the default 
value for null. If `druid.generic.useDefaultValueForNull=false`, then 
aggregation will returns any non-null value.
+If `druid.generic.useDefaultValueForNull=true` aggregation can returns the 
default value for null and does not prefer "non-null" values over the default 
value for null. If `druid.generic.useDefaultValueForNull=false`, then 
aggregation will returns any value including null.
 
 Review comment:
   I think we should re-write this to explain why someone would use this 
aggregator similar to how it's explained in the snowflake docs - 
https://docs.snowflake.net/manuals/sql-reference/functions/any_value.html
   I'm not sure where the correct place in the docs is to explain this - since 
technically this is the spec for the native query and we have another page with 
a spec for sql.
   
   Here's my suggestion for the why:
   
   ```
   ANY aggregator can be used to simplify and optimize the performance of GROUP 
BY statements. A common problem for many queries is that the result of a query 
with a GROUP BY clause can only contain expressions used in the GROUP BY clause 
itself, or results of aggregate functions
   
   select customer.id , customer.name , sum(orders.value)
       from customer
       join orders on customer.id = orders.customer_id
       group by customer.id , customer.name;
   
   Since we know that each customer.id can have only one name, this can be 
optimized as
   
   select customer.id , ANY(customer.name) , sum(orders.value)
       from customer
       join orders on customer.id = orders.customer_id
       group by customer.id ;
   ```
   
   I should also point out, with the current implementation of aggregators, 
there is no advantage to using an ANY aggregator vs a min aggregator, but maybe 
that will change in the future 🤷‍♂ 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] suneet-s commented on a change in pull request #9317: ANY Aggregator should not skip null values implementation

Reply via email to