zero323 commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema. URL: https://github.com/apache/spark/pull/26118#issuecomment-558274617 > I did some performance tests (Details in [this gist](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d)). > Thank you! > As expected, for `Row`s that are created using kwargs (has `__from_dict__`) AND where fields are ordered alphabetically the performance is worse (~15% at 15 fields, ~25% at 150 fields) and [memory](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080607) [consumption](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080608) increases. That seem acceptable in my opinion - not great, but given the diminishing importance of `Row` it is not the most serious concern I guess. > In my experience, not being able to create a dataframe from dict-like `Row`s was a time-consuming annoyance. The value added by this PR is to enable this. I will just point out that `dicts`, `OrderedDicts`, plain `tuples` or `namedtuples` are much more efficient input structures when schema is provided. The biggest value of `Row` is that it provides named structure to communicate results (and let's be honest - it doesn't do it very well). But that's just a side note.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
