zero323 commented on issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include __from_dict__ flag URL: https://github.com/apache/spark/pull/20280#issuecomment-546596300 @BryanCutler The idea of controlling sorting behavior via configuration seem legit, but I wouldn't go so far, as adding a subclass, unless we plan to have diverging implementation later in time (thought I think that in such case it would make more sense to have common base class or dynamically provide one, or another implementation). In other words we could resort input if: - We get `kwargs` - Python is 3.6 or later (Python < 3.6 should stay as-is to avoid confusion). - Corresponding config option is set to true. ```python class Row(tuple): def __new__(self, *args, **kwargs): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") if kwargs: if sys.version_info > (3, 6) and conf.get("python.useLegacyRowBehaviour"): names, values = zip(*sorted(kwargs.items())) else: names, values = list(kwargs.keys()), kwargs.values() row = tuple.__new__(self, values) row.__fields__ = names row.__from_dict__ = True return row else: # create row class or objects return tuple.__new__(self, args) ``` I don't think that providing separate path for `OrderedDict` makes much sense as, despite outstanding deprecations, `dict` can be used in most of the contexts where `Row` is expected, and even if it wasn't the case, users can always unpack the dict: ```python d: OrderedDict = ... Row(**d) ``` If anything there should be a separate path for `OrderedDicts` in `_infer_schema`* https://github.com/apache/spark/blob/2115bf61465b504bc21e37465cb34878039b5cb8/python/pyspark/sql/types.py#L1043-L1044 Side notes: - If `Row` is to be rewritten could we replace dunders (`__fields__`, `__from_dict__ `) with leading underscore? That's a code smell, and even if it is unlikely to break in the future, it quite confusing for the reader. - [I asked on dev](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Deprecate-Python-lt-3-6-in-Spark-3-0-td28168.html) about deprecating Python < 3.6 support. Feedback provided there might be relevant here. - What should be done about `Row.asDict`? In general in Python 3.6+ ordering will be preserved, but maybe we should be explicit, and use `OrderedDict`? ---------------- \* In general schema inference will require some work to make it consistent with `Row` behavior. For example 3.6+ classes do preserve attribute definition order [PEP 520](https://www.python.org/dev/peps/pep-0520/) so the current approach https://github.com/apache/spark/blob/2115bf61465b504bc21e37465cb34878039b5cb8/python/pyspark/sql/types.py#L1059 would be undesired in 3.7 without legacy mode.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
