zero323 commented on issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row 
pickling to include __from_dict__ flag
URL: https://github.com/apache/spark/pull/20280#issuecomment-546596300
 
 
   @BryanCutler The idea of controlling sorting behavior via configuration seem 
legit, but I wouldn't go so far, as adding a subclass, unless we plan to have 
diverging implementation later in time (thought I think that in such case it 
would make more sense to have common base class or dynamically provide one, or 
another implementation). 
   
   In other words we could resort input if:
   
   - We get `kwargs`
   - Python is 3.6 or later (Python < 3.6 should stay as-is to avoid confusion).
   - Corresponding config option is set to true.
   
   ```python
   class Row(tuple):
       def __new__(self, *args, **kwargs):
           if args and kwargs:
               raise ValueError("Can not use both args "
                                "and kwargs to create Row")
   
           if kwargs:
               if sys.version_info > (3, 6) and 
conf.get("python.useLegacyRowBehaviour"):
                   names, values = zip(*sorted(kwargs.items()))
               else:
                   names, values = list(kwargs.keys()), kwargs.values()
               row = tuple.__new__(self, values)
               row.__fields__ = names
               row.__from_dict__ = True
               return row
   
           else:
               # create row class or objects
               return tuple.__new__(self, args)
   ```
   
   I don't think that providing separate path for `OrderedDict` makes much 
sense as, despite outstanding deprecations, `dict` can be used in most of the 
contexts where `Row` is expected, and even if it wasn't the case, users can 
always unpack the dict:
   
   ```python
   d: OrderedDict = ...
   Row(**d)
   ```
   
   If anything there should be a separate path for `OrderedDicts` in 
`_infer_schema`*
   
   
https://github.com/apache/spark/blob/2115bf61465b504bc21e37465cb34878039b5cb8/python/pyspark/sql/types.py#L1043-L1044
   
   
   Side notes:
   
   - If `Row` is to be rewritten could we replace dunders (`__fields__`, 
`__from_dict__ `) with leading underscore? That's a code smell, and even if it 
is unlikely to break in the future, it quite confusing for the reader.
   - [I asked on 
dev](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Deprecate-Python-lt-3-6-in-Spark-3-0-td28168.html)
 about deprecating Python < 3.6 support. Feedback provided there might be 
relevant here.
   - What should be done about `Row.asDict`? In general in Python 3.6+ ordering 
will be preserved, but maybe we should be explicit, and use `OrderedDict`?
   
   ----------------
   \* In general schema inference will require some work to make it consistent 
with `Row` behavior. For example 3.6+ classes do preserve attribute definition 
order [PEP 520](https://www.python.org/dev/peps/pep-0520/) so the current 
approach
   
   
https://github.com/apache/spark/blob/2115bf61465b504bc21e37465cb34878039b5cb8/python/pyspark/sql/types.py#L1059
   
   would be undesired in 3.7 without legacy mode.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to