d80tb7 commented on a change in pull request #24981: [WIP][SPARK-27463][PYTHON] 
Support Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl
URL: https://github.com/apache/spark/pull/24981#discussion_r302745939
 
 

 ##########
 File path: python/pyspark/worker.py
 ##########
 @@ -379,6 +426,21 @@ def map_batch(batch):
     return func, None, ser, ser
 
 
+def parse_grouped_arg_offsets(arg_offsets):
 
 Review comment:
   Yes, very true.  I've added some comments to explain what's going on here 
(note that this is now moved to an inner function as you suggested in a another 
comment).  That said, I feel all this is a bit too complicated and I'd like to 
revisit it. Unfortunately I don't think we can change this without affecting 
the way that the arg_offsets for the grouped_map functions work and I didn't to 
make changes to that at this point.
   
   I think the complexity arises because:
   
   - We have to pack a lot of information (the indexes of all key and value 
fields) into the arg_offsets array.  This was already leading to a somewhat 
complex encoding for the grouped map case and I've had to make it more complex 
here as we now need to deal with multiple dataframes.
   
   - We receive the dataframe(s) from the Arrow stream as arrow tables, which 
are then decomposed into lists of pandas tables and then reconstituted as 
pandas dataframes (using the indexes extracted from arg_offsets) at a later 
point.  Following this progression in the code is a bit tricky here not least 
because there's some generated code in the middle which makes it all work. 
   
   I understand why the above has been done, but it does feel like it's a bit 
over complicated.  @icexelloss Do you have any thoughts on this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to