zero323 commented on a change in pull request #27109:
[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas'
sub-package
URL: https://github.com/apache/spark/pull/27109#discussion_r363472878
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -31,23 +31,23 @@
from pyspark import copy_func, since, _NoValue
from pyspark.rdd import RDD, _load_from_socket, _local_iterator_from_socket, \
- ignore_unicode_prefix, PythonEvalType
-from pyspark.serializers import ArrowCollectSerializer, BatchedSerializer,
PickleSerializer, \
+ ignore_unicode_prefix
+from pyspark.serializers import BatchedSerializer, PickleSerializer, \
UTF8Deserializer
from pyspark.storagelevel import StorageLevel
from pyspark.traceback_utils import SCCallSiteSync
from pyspark.sql.types import _parse_datatype_json_string
from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
from pyspark.sql.readwriter import DataFrameWriter
from pyspark.sql.streaming import DataStreamWriter
-from pyspark.sql.types import IntegralType
from pyspark.sql.types import *
-from pyspark.util import _exception_message
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
__all__ = ["DataFrame", "DataFrameNaFunctions", "DataFrameStatFunctions"]
-class DataFrame(object):
+class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
Review comment:
In general I am trying to get a better feeling of overall purpose of such
refactoring.
As for now there is no indication that any of these mixins will be ever used
outside the current context (`DataFrame` and `GroupedData`). That impression is
further enforced by explicit type checks
([here](https://github.com/apache/spark/blob/cfd78393e76f454503e7cf5416f6d56f1efffd0a/python/pyspark/sql/pandas/group_ops.py#L96)
and
[here](https://github.com/apache/spark/blob/cfd78393e76f454503e7cf5416f6d56f1efffd0a/python/pyspark/sql/pandas/map_ops.py#L64)).
So that doesn't really seem like a canonical use of mixin, especially when
base core `DataFrame` is not designed for extensiblity.
> Ah you mean API usages like:
>
> df.pandas.mapInPandas(...)
That's one possible approach though not the one I was thinking about. I
assumed (though I am not sure, as the amount of code moved, excluding docs,
message and some static stuff is negligible, and tightly coupled with
`DataFrame` anyway) that the point is maintainability.
So possible approach is either direct
def __init__(self, ...):
...
self._pandasMapOpsMixin = PandasMapOpsMixin(self)
...
def mapInPandas(self, udf):
return self._pandasMapOpsMixin.mapInPandas(udf)
or indirect (by overwriting `__geattr__`).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]