khwilson commented on PR #44184: URL: https://github.com/apache/arrow/pull/44184#issuecomment-2874763959
If I understand the code correctly, what happens for the mean for decimals is that it first computes the sum by using the underlying `+` of the Decimal type, then computes the count of values as a long, and does an integer division of the two. As seen in this commit, that `+` operator essentially ignores precision. I think this is a reasonable implementation except: * There should be a checked variant; * Potentially, you'd want to add some extra scale, in line with the binary division operation. For comparison, I checked how a few other DBMS's handle the mean: * postgresql: Drops all precision and scale information in mean * duckdb: Turns into a double * mariadb: Appears to just add 4 to scale and precision For product, it's not very common for databases to provide a product aggregator of any kind. I quickly looked at ANSI SQL, postgres, mysql, and clickhouse, none of which provide a product. The "standard" product implementation for such aggregates is `(2 * sum(col % 2) - 1) * exp(sum(ln(abs(col)))`, but that requires transforming into a double. Moreover, basically any column with at least 100 values will overflow a decimal's precision. So I'd propose for product either demoting to double (will give a semantically correct value but lose precision) or maxing out the width of the decimal type. Of course, there should also be a checked variant (but I'm deferring that until after this commit is complete). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org