zero323 edited a comment on issue #26118: [SPARK-24915][Python] Fix Row 
handling with Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-546617204
 
 
   @HyukjinKwon To be honest I have mixed feelings about this. It looks 
sensible as a _temporary workaround_, but I am not fond of the idea of 
enforcing notion of `Row` being an unordered  dictionary-like object (though 
with compact dict as standard, that doesn't matter that much), especially when 
it is close to becoming completely obsolete. 
   
   Personally I'd prefer to wait a moment and see where the discussion on 
SPARK-22232 goes. If the resolution is introduction of legacy mode, then the 
scope of this particular change could be conditioned on it and Python version.
   
   If not I'd like to see some memory profiling data (especially memory - 
timings might be actually better for now, as we skip all the nasty `obj[n]`, 
but that's not very meaningful*) first.
   
   I've done some rough testing and conversion to dict (with simple 
optimization suggested below) is at roughly six times slower than conversion to 
`tuple`. I'd expect that there is also significant memory overhead of 
dictionary conversion, as we effectively create a full copy of the data with 
associated names. If that suspicion is confirmed that would be a huge overhead, 
and shouldn't be  incurred, if it is not necessary.
   
   ----
   \* Is there any reason why we do this:
   
   
https://github.com/apache/spark/blob/2115bf61465b504bc21e37465cb34878039b5cb8/python/pyspark/sql/types.py#L615
   
   instead of just `tuple(obj)`? That's huge performance bottleneck with wide 
schemas. Depending on the resolution of this one, that's something to fix, 
don't you think?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to