khwilson commented on PR #44184:
URL: https://github.com/apache/arrow/pull/44184#issuecomment-2874763959

   If I understand the code correctly, what happens for the mean for decimals 
is that it first computes the sum by using the underlying `+` of the Decimal 
type, then computes the count of values as a long, and does an integer division 
of the two. As seen in this commit, that `+` operator essentially ignores 
precision. I think this is a reasonable implementation except:
   
   * There should be a checked variant;
   * Potentially, you'd want to add some extra scale, in line with the binary 
division operation.
   
   For comparison, I checked how a few other DBMS's handle the mean:
   * postgresql: Drops all precision and scale information in mean
   * duckdb: Turns into a double
   * mariadb: Appears to just add 4 to scale and precision
   
   For product, it's not very common for databases to provide a product 
aggregator of any kind. I quickly looked at ANSI SQL, postgres, mysql, and 
clickhouse, none of which provide a product. The "standard" product 
implementation for such aggregates is `(2 * sum(col % 2) - 1) * 
exp(sum(ln(abs(col)))`, but that requires transforming into a double. Moreover, 
basically any column with at least 100 values will overflow a decimal's 
precision.
   
   So I'd propose for product either demoting to double (will give a 
semantically correct value but lose precision) or maxing out the width of the 
decimal type. Of course, there should also be a checked variant (but I'm 
deferring that until after this commit is complete).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to