d80tb7 commented on a change in pull request #24981: [WIP][SPARK-27463][PYTHON]
Support Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl
URL: https://github.com/apache/spark/pull/24981#discussion_r302745939
##########
File path: python/pyspark/worker.py
##########
@@ -379,6 +426,21 @@ def map_batch(batch):
return func, None, ser, ser
+def parse_grouped_arg_offsets(arg_offsets):
Review comment:
Yes, very true. I've added some comments to explain what's going on here
(note that this is now moved to an inner function as you suggested in a another
comment). That said, I feel all this is a bit too complicated and I'd like to
revisit it. Unfortunately I don't think we can change this without affecting
the way that the arg_offsets for the grouped_map functions work and I didn't to
make changes to that at this point.
I think the complexity arises because:
- We have to pack a lot of information (the indexes of all key and value
fields) into the arg_offsets array. This was already leading to a somewhat
complex encoding for the grouped map case and I've had to make it more complex
here as we now need to deal with multiple dataframes.
- We receive the dataframe(s) from the Arrow stream as arrow tables, which
are then decomposed into lists of pandas tables and then reconstituted as
pandas dataframes (using the indexes extracted from arg_offsets) at a later
point. Following this progression in the code is a bit tricky here not least
because there's some generated code in the middle which makes it all work.
I understand why the above has been done, but it does feel like it's a bit
over complicated. @icexelloss Do you have any thoughts on this?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]