icexelloss opened a new issue, #35515:
URL: https://github.com/apache/arrow/issues/35515

   ### Describe the enhancement requested
   
   Non decomposable aggregation is aggregation that cannot be split into 
consume/merge/finalize. This is often when the logic rewritten with external 
python libraries (numpy, pandas, statmodels, etc) and those either cannot be 
decomposed or not worthy the effect (these are often one-off function instead 
of reusable one). This PR implements the support for non decomposable 
aggregation UDFs.
   
   The major issue with non decomposable UDF is that the UDF needs to see all 
data at once, unlike scalar UDF where UDF only needs to see a batch at a time. 
This makes non decomposable not so useful as it is same as collect all the data 
to a pd.DataFrame and apply the UDF on it. However, one very application of non 
decomposable UDF is with segmented aggregation. To refresh, segmented 
aggregation works on ordered data and passed one logic chunk at a time (e.g., 
all data with the same date). With segmented aggregation and non decomposable 
aggregation UDF, the user can apply any custom aggregation logic over large 
stream of ordered data, with the memory overhead of a single segment.
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to