[GitHub] [spark] zero323 commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.

GitBox Mon, 25 Nov 2019 10:11:29 -0800

zero323 commented on issue #26118: [SPARK-24915][Python] Fix Row handling with 
Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-558274617
 
 
   > I did some performance tests (Details in [this 
gist](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d)).
   > 
   
   Thank you!
   
   > As expected, for `Row`s that are created using kwargs (has 
`__from_dict__`) AND where fields are ordered alphabetically the performance is 
worse (~15% at 15 fields, ~25% at 150 fields) and 
[memory](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080607)
 
[consumption](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080608)
 increases.
   
   That seem acceptable in my opinion - not great, but given the diminishing 
importance of `Row` it is not the most serious concern I guess. 
   
   > In my experience, not being able to create a dataframe from dict-like 
`Row`s was a time-consuming annoyance. The value added by this PR is to enable 
this.
   
   I will just point out that `dicts`, `OrderedDicts`, plain `tuples` or 
`namedtuples` are much more efficient input structures when schema is provided. 
The biggest value of `Row` is that it provides named structure to communicate 
results (and let's be honest - it doesn't do it very well). But that's just a 
side note.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zero323 commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.

Reply via email to