Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
dongjoon-hyun commented on PR #46129: URL: https://github.com/apache/spark/pull/46129#issuecomment-2068736545 `classic` sounds like a too limited wording because it has no clear meaning and not-extensible in a long-term perspective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon closed pull request #46129: [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic URL: https://github.com/apache/spark/pull/46129 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on PR #46129: URL: https://github.com/apache/spark/pull/46129#issuecomment-2068279142 Merged to master. I will followup if there are more comments to address. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573466682 ## python/pyspark/sql/tests/connect/test_connect_plan.py: ## @@ -333,6 +333,11 @@ def test_observe(self): from pyspark.sql.connect.observation import Observation class MockDF(DataFrame): Review Comment: This might be a breaking change if somebody inherits `pyspark.sql.DataFrame` before, and it has it's own `__init__`. However, `__init__` is not really an API, and users shouldn't really customize/use/invoke them directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on PR #46129: URL: https://github.com/apache/spark/pull/46129#issuecomment-2067819605 Should be ready for a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573465793 ## python/pyspark/sql/classic/dataframe.py: ## @@ -0,0 +1,1974 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import json +import sys +import random +import warnings +from collections.abc import Iterable +from functools import reduce +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + +from pyspark import _NoValue +from pyspark.resource import ResourceProfile +from pyspark._globals import _NoValueType +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) +from pyspark.util import ( +is_remote_only, +_load_from_socket, +_local_iterator_from_socket, +) +from pyspark.serializers import BatchedSerializer, CPickleSerializer, UTF8Deserializer +from pyspark.storagelevel import StorageLevel +from pyspark.traceback_utils import SCCallSiteSync +from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column +from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2 +from pyspark.sql.streaming import DataStreamWriter +from pyspark.sql.types import ( +StructType, +Row, +_parse_datatype_json_string, +) +from pyspark.sql.dataframe import ( +DataFrame as ParentDataFrame, +DataFrameNaFunctions as ParentDataFrameNaFunctions, +DataFrameStatFunctions as ParentDataFrameStatFunctions, +) +from pyspark.sql.utils import get_active_spark_context, toJArray +from pyspark.sql.pandas.conversion import PandasConversionMixin +from pyspark.sql.pandas.map_ops import PandasMapOpsMixin + +if TYPE_CHECKING: +from py4j.java_gateway import JavaObject +from pyspark.core.rdd import RDD +from pyspark.core.context import SparkContext +from pyspark._typing import PrimitiveType +from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame +from pyspark.sql._typing import ( +ColumnOrName, +ColumnOrNameOrOrdinal, +LiteralType, +OptionalPrimitiveType, +) +from pyspark.sql.context import SQLContext +from pyspark.sql.session import SparkSession +from pyspark.sql.group import GroupedData +from pyspark.sql.observation import Observation + + +class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin): +def __new__( +cls, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +) -> "DataFrame": +self = object.__new__(cls) +self.__init__(jdf, sql_ctx) # type: ignore[misc] +return self + +def __init__( +self, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +): +from pyspark.sql.context import SQLContext + +self._sql_ctx: Optional["SQLContext"] = None + +if isinstance(sql_ctx, SQLContext): +assert not os.environ.get("SPARK_TESTING") # Sanity check for our internal usage. +assert isinstance(sql_ctx, SQLContext) +# We should remove this if-else branch in the future release, and rename +# sql_ctx to session in the constructor. This is an internal code path but +# was kept with a warning because it's used intensively by third-party libraries. +warnings.warn("DataFrame constructor is internal. Do not directly use it.") +self._sql_ctx = sql_ctx +session = sql_ctx.sparkSession +else: +session = sql_ctx +self._session: "SparkSession" = session + +self._sc: "SparkContext" = sql_ctx._sc +self._jdf: "JavaObject" = jdf +self.is_cached = False +# initialized lazily +self._schema: Optional[StructType] = None +self._lazy_rdd: Optional["RDD[Row]"] = None +# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice +# by __repr__ and _repr_html_ while eager evaluation opens. +self._support_repr_html = False + +@property +def
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573465384 ## python/pyspark/sql/connect/session.py: ## @@ -325,7 +325,7 @@ def active(cls) -> "SparkSession": active.__doc__ = PySparkSession.active.__doc__ -def table(self, tableName: str) -> DataFrame: +def table(self, tableName: str) -> ParentDataFrame: Review Comment: Here is one example of the error: ``` python/pyspark/sql/classic/dataframe.py:276: error: Argument 1 of "exceptAll" is incompatible with supertype "DataFrame"; supertype defines the argument type as "DataFrame" [override] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573465371 ## python/pyspark/sql/connect/session.py: ## @@ -325,7 +325,7 @@ def active(cls) -> "SparkSession": active.__doc__ = PySparkSession.active.__doc__ -def table(self, tableName: str) -> DataFrame: +def table(self, tableName: str) -> ParentDataFrame: Review Comment: Seems like the arguments cannot be more specific type, and return types can't be wider types (https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides). So it complains about the argument. Let me just keep them all as parent dataframe for simplicity because those types aren't user-facing anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573095371 ## python/pyspark/sql/classic/dataframe.py: ## @@ -0,0 +1,1974 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import json +import sys +import random +import warnings +from collections.abc import Iterable +from functools import reduce +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + +from pyspark import _NoValue +from pyspark.resource import ResourceProfile +from pyspark._globals import _NoValueType +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) +from pyspark.util import ( +is_remote_only, +_load_from_socket, +_local_iterator_from_socket, +) +from pyspark.serializers import BatchedSerializer, CPickleSerializer, UTF8Deserializer +from pyspark.storagelevel import StorageLevel +from pyspark.traceback_utils import SCCallSiteSync +from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column +from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2 +from pyspark.sql.streaming import DataStreamWriter +from pyspark.sql.types import ( +StructType, +Row, +_parse_datatype_json_string, +) +from pyspark.sql.dataframe import ( +DataFrame as ParentDataFrame, +DataFrameNaFunctions as ParentDataFrameNaFunctions, +DataFrameStatFunctions as ParentDataFrameStatFunctions, +) +from pyspark.sql.utils import get_active_spark_context, toJArray +from pyspark.sql.pandas.conversion import PandasConversionMixin +from pyspark.sql.pandas.map_ops import PandasMapOpsMixin + +if TYPE_CHECKING: +from py4j.java_gateway import JavaObject +from pyspark.core.rdd import RDD +from pyspark.core.context import SparkContext +from pyspark._typing import PrimitiveType +from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame +from pyspark.sql._typing import ( +ColumnOrName, +ColumnOrNameOrOrdinal, +LiteralType, +OptionalPrimitiveType, +) +from pyspark.sql.context import SQLContext +from pyspark.sql.session import SparkSession +from pyspark.sql.group import GroupedData +from pyspark.sql.observation import Observation + + +class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin): +def __new__( +cls, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +) -> "DataFrame": +self = object.__new__(cls) +self.__init__(jdf, sql_ctx) # type: ignore[misc] +return self + +def __init__( +self, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +): +from pyspark.sql.context import SQLContext + +self._sql_ctx: Optional["SQLContext"] = None + +if isinstance(sql_ctx, SQLContext): +assert not os.environ.get("SPARK_TESTING") # Sanity check for our internal usage. +assert isinstance(sql_ctx, SQLContext) +# We should remove this if-else branch in the future release, and rename +# sql_ctx to session in the constructor. This is an internal code path but +# was kept with a warning because it's used intensively by third-party libraries. +warnings.warn("DataFrame constructor is internal. Do not directly use it.") +self._sql_ctx = sql_ctx +session = sql_ctx.sparkSession +else: +session = sql_ctx +self._session: "SparkSession" = session + +self._sc: "SparkContext" = sql_ctx._sc +self._jdf: "JavaObject" = jdf +self.is_cached = False +# initialized lazily +self._schema: Optional[StructType] = None +self._lazy_rdd: Optional["RDD[Row]"] = None +# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice +# by __repr__ and _repr_html_ while eager evaluation opens. +self._support_repr_html = False + +@property +def
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573095413 ## python/pyspark/sql/classic/dataframe.py: ## @@ -0,0 +1,1974 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import json +import sys +import random +import warnings +from collections.abc import Iterable +from functools import reduce +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + +from pyspark import _NoValue +from pyspark.resource import ResourceProfile +from pyspark._globals import _NoValueType +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) +from pyspark.util import ( +is_remote_only, +_load_from_socket, +_local_iterator_from_socket, +) +from pyspark.serializers import BatchedSerializer, CPickleSerializer, UTF8Deserializer +from pyspark.storagelevel import StorageLevel +from pyspark.traceback_utils import SCCallSiteSync +from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column +from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2 +from pyspark.sql.streaming import DataStreamWriter +from pyspark.sql.types import ( +StructType, +Row, +_parse_datatype_json_string, +) +from pyspark.sql.dataframe import ( +DataFrame as ParentDataFrame, +DataFrameNaFunctions as ParentDataFrameNaFunctions, +DataFrameStatFunctions as ParentDataFrameStatFunctions, +) +from pyspark.sql.utils import get_active_spark_context, toJArray +from pyspark.sql.pandas.conversion import PandasConversionMixin +from pyspark.sql.pandas.map_ops import PandasMapOpsMixin + +if TYPE_CHECKING: +from py4j.java_gateway import JavaObject +from pyspark.core.rdd import RDD +from pyspark.core.context import SparkContext +from pyspark._typing import PrimitiveType +from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame +from pyspark.sql._typing import ( +ColumnOrName, +ColumnOrNameOrOrdinal, +LiteralType, +OptionalPrimitiveType, +) +from pyspark.sql.context import SQLContext +from pyspark.sql.session import SparkSession +from pyspark.sql.group import GroupedData +from pyspark.sql.observation import Observation + + +class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin): +def __new__( +cls, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +) -> "DataFrame": +self = object.__new__(cls) +self.__init__(jdf, sql_ctx) # type: ignore[misc] +return self + +def __init__( +self, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +): +from pyspark.sql.context import SQLContext + +self._sql_ctx: Optional["SQLContext"] = None + +if isinstance(sql_ctx, SQLContext): +assert not os.environ.get("SPARK_TESTING") # Sanity check for our internal usage. +assert isinstance(sql_ctx, SQLContext) +# We should remove this if-else branch in the future release, and rename +# sql_ctx to session in the constructor. This is an internal code path but +# was kept with a warning because it's used intensively by third-party libraries. +warnings.warn("DataFrame constructor is internal. Do not directly use it.") +self._sql_ctx = sql_ctx +session = sql_ctx.sparkSession +else: +session = sql_ctx +self._session: "SparkSession" = session + +self._sc: "SparkContext" = sql_ctx._sc +self._jdf: "JavaObject" = jdf +self.is_cached = False +# initialized lazily +self._schema: Optional[StructType] = None +self._lazy_rdd: Optional["RDD[Row]"] = None +# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice +# by __repr__ and _repr_html_ while eager evaluation opens. +self._support_repr_html = False + +@property +def
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573095371 ## python/pyspark/sql/classic/dataframe.py: ## @@ -0,0 +1,1974 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import json +import sys +import random +import warnings +from collections.abc import Iterable +from functools import reduce +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + +from pyspark import _NoValue +from pyspark.resource import ResourceProfile +from pyspark._globals import _NoValueType +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) +from pyspark.util import ( +is_remote_only, +_load_from_socket, +_local_iterator_from_socket, +) +from pyspark.serializers import BatchedSerializer, CPickleSerializer, UTF8Deserializer +from pyspark.storagelevel import StorageLevel +from pyspark.traceback_utils import SCCallSiteSync +from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column +from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2 +from pyspark.sql.streaming import DataStreamWriter +from pyspark.sql.types import ( +StructType, +Row, +_parse_datatype_json_string, +) +from pyspark.sql.dataframe import ( +DataFrame as ParentDataFrame, +DataFrameNaFunctions as ParentDataFrameNaFunctions, +DataFrameStatFunctions as ParentDataFrameStatFunctions, +) +from pyspark.sql.utils import get_active_spark_context, toJArray +from pyspark.sql.pandas.conversion import PandasConversionMixin +from pyspark.sql.pandas.map_ops import PandasMapOpsMixin + +if TYPE_CHECKING: +from py4j.java_gateway import JavaObject +from pyspark.core.rdd import RDD +from pyspark.core.context import SparkContext +from pyspark._typing import PrimitiveType +from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame +from pyspark.sql._typing import ( +ColumnOrName, +ColumnOrNameOrOrdinal, +LiteralType, +OptionalPrimitiveType, +) +from pyspark.sql.context import SQLContext +from pyspark.sql.session import SparkSession +from pyspark.sql.group import GroupedData +from pyspark.sql.observation import Observation + + +class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin): +def __new__( +cls, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +) -> "DataFrame": +self = object.__new__(cls) +self.__init__(jdf, sql_ctx) # type: ignore[misc] +return self + +def __init__( +self, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +): +from pyspark.sql.context import SQLContext + +self._sql_ctx: Optional["SQLContext"] = None + +if isinstance(sql_ctx, SQLContext): +assert not os.environ.get("SPARK_TESTING") # Sanity check for our internal usage. +assert isinstance(sql_ctx, SQLContext) +# We should remove this if-else branch in the future release, and rename +# sql_ctx to session in the constructor. This is an internal code path but +# was kept with a warning because it's used intensively by third-party libraries. +warnings.warn("DataFrame constructor is internal. Do not directly use it.") +self._sql_ctx = sql_ctx +session = sql_ctx.sparkSession +else: +session = sql_ctx +self._session: "SparkSession" = session + +self._sc: "SparkContext" = sql_ctx._sc +self._jdf: "JavaObject" = jdf +self.is_cached = False +# initialized lazily +self._schema: Optional[StructType] = None +self._lazy_rdd: Optional["RDD[Row]"] = None +# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice +# by __repr__ and _repr_html_ while eager evaluation opens. +self._support_repr_html = False + +@property +def
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573094548 ## python/pyspark/sql/utils.py: ## @@ -302,6 +302,33 @@ def wrapped(*args: Any, **kwargs: Any) -> Any: return cast(FuncT, wrapped) +def dispatch_df_method(f: FuncT) -> FuncT: +""" +For the usecases of direct DataFrame.union(df, ...), it checks if self +is a Connect DataFrame or Classic DataFrame, and dispatches. +""" + +@functools.wraps(f) +def wrapped(*args: Any, **kwargs: Any) -> Any: +if is_remote() and "PYSPARK_NO_NAMESPACE_SHARE" not in os.environ: +from pyspark.sql.connect.dataframe import DataFrame as ConnectDataFrame + +if isinstance(args[0], ConnectDataFrame): +return getattr(ConnectDataFrame, f.__name__)(*args, **kwargs) +else: +from pyspark.sql.classic.dataframe import DataFrame as ClassicDataFrame Review Comment: It should be covered by `is_remote` which is `True` when `is_remote_only` is `True`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1573094799 ## python/pyspark/sql/connect/session.py: ## @@ -325,7 +325,7 @@ def active(cls) -> "SparkSession": active.__doc__ = PySparkSession.active.__doc__ -def table(self, tableName: str) -> DataFrame: +def table(self, tableName: str) -> ParentDataFrame: Review Comment: this was the way MyPy least complained IIRC. Let me take a look again .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
ueshin commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1572787315 ## python/pyspark/sql/dataframe.py: ## @@ -139,51 +123,29 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): created via using the constructor. """ -def __init__( -self, +# HACK ALERT!! this is to reduce the backward compatibility concern, and returns +# Spark Classic DataFrame by default. This is NOT an API, and NOT supposed to +# be directly invoked. DO NOT use this constructor. +_sql_ctx: Optional["SQLContext"] +_session: "SparkSession" +_sc: "SparkContext" +_jdf: "JavaObject" +is_cached: bool +_schema: Optional[StructType] +_lazy_rdd: Optional["RDD[Row]"] +_support_repr_html: bool + +def __new__( +cls, jdf: "JavaObject", sql_ctx: Union["SQLContext", "SparkSession"], -): -from pyspark.sql.context import SQLContext - -self._sql_ctx: Optional["SQLContext"] = None - -if isinstance(sql_ctx, SQLContext): -assert not os.environ.get("SPARK_TESTING") # Sanity check for our internal usage. -assert isinstance(sql_ctx, SQLContext) -# We should remove this if-else branch in the future release, and rename -# sql_ctx to session in the constructor. This is an internal code path but -# was kept with a warning because it's used intensively by third-party libraries. -warnings.warn("DataFrame constructor is internal. Do not directly use it.") -self._sql_ctx = sql_ctx -session = sql_ctx.sparkSession -else: -session = sql_ctx -self._session: "SparkSession" = session - -self._sc: "SparkContext" = sql_ctx._sc -self._jdf: "JavaObject" = jdf -self.is_cached = False -# initialized lazily -self._schema: Optional[StructType] = None -self._lazy_rdd: Optional["RDD[Row]"] = None -# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice -# by __repr__ and _repr_html_ while eager evaluation opens. -self._support_repr_html = False - -@property -def sql_ctx(self) -> "SQLContext": -from pyspark.sql.context import SQLContext +) -> "DataFrame": +from pyspark.sql.classic.dataframe import DataFrame -warnings.warn( -"DataFrame.sql_ctx is an internal property, and will be removed " -"in future releases. Use DataFrame.sparkSession instead." -) -if self._sql_ctx is None: -self._sql_ctx = SQLContext._get_or_create(self._sc) -return self._sql_ctx +return DataFrame.__new__(DataFrame, jdf, sql_ctx) @property +@dispatch_df_method Review Comment: The dispatch for `property` seems not working. ```py >>> class A: ... @property ... @dispatch_df_method ... def a(self): ... return 1 >>> >>> a = A() >>> A.a(a) Traceback (most recent call last): File "", line 1, in TypeError: 'property' object is not callable ``` I don't think we need this for `property` as this usage of `property` won't work anyway? ## python/pyspark/sql/classic/dataframe.py: ## @@ -0,0 +1,1974 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import json +import sys +import random +import warnings +from collections.abc import Iterable +from functools import reduce +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + +from pyspark import _NoValue +from pyspark.resource import ResourceProfile +from pyspark._globals import _NoValueType +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) +from pyspark.util import ( +is_remote_only, +_load_from_socket, +_local_iterator_from_socket, +) +from pyspark.serializers import BatchedSerializer, CPickleSerializer, UTF8Deserializer +from pyspark.storagelevel import
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on PR #46129: URL: https://github.com/apache/spark/pull/46129#issuecomment-2067140476 Will fix up the tests soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1572006465 ## python/pyspark/sql/connect/dataframe.py: ## @@ -2306,7 +2183,7 @@ def _test() -> None: ) (failure_count, test_count) = doctest.testmod( -pyspark.sql.connect.dataframe, +pyspark.sql.dataframe, Review Comment: It does inherit the docstrings but `doctest` cannot. So I manually switch it to `pyspark.sql.dataframe` to run the doctest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1571898280 ## python/pyspark/sql/classic/dataframe.py: ## @@ -0,0 +1,1952 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import json +import sys +import random +import warnings +from collections.abc import Iterable +from functools import reduce +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + +from pyspark import _NoValue +from pyspark._globals import _NoValueType +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) +from pyspark.util import ( +is_remote_only, +_load_from_socket, +_local_iterator_from_socket, +) +from pyspark.serializers import BatchedSerializer, CPickleSerializer, UTF8Deserializer +from pyspark.storagelevel import StorageLevel +from pyspark.traceback_utils import SCCallSiteSync +from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column +from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2 +from pyspark.sql.streaming import DataStreamWriter +from pyspark.sql.types import ( +StructType, +Row, +_parse_datatype_json_string, +) +from pyspark.sql.dataframe import ( +DataFrame as ParentDataFrame, +DataFrameNaFunctions as ParentDataFrameNaFunctions, +DataFrameStatFunctions as ParentDataFrameStatFunctions, +) +from pyspark.sql.utils import get_active_spark_context, toJArray +from pyspark.sql.pandas.conversion import PandasConversionMixin +from pyspark.sql.pandas.map_ops import PandasMapOpsMixin + +if TYPE_CHECKING: +from py4j.java_gateway import JavaObject +from pyspark.core.rdd import RDD +from pyspark.core.context import SparkContext +from pyspark._typing import PrimitiveType +from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame +from pyspark.sql._typing import ( +ColumnOrName, +ColumnOrNameOrOrdinal, +LiteralType, +OptionalPrimitiveType, +) +from pyspark.sql.context import SQLContext +from pyspark.sql.session import SparkSession +from pyspark.sql.group import GroupedData +from pyspark.sql.observation import Observation + + +class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin): +def __new__( +cls, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +) -> "DataFrame": +self = object.__new__(cls) +self.__init__(jdf, sql_ctx) # type: ignore[misc] +return self + +def __init__( +self, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +): +from pyspark.sql.context import SQLContext + +self._sql_ctx: Optional["SQLContext"] = None + +if isinstance(sql_ctx, SQLContext): +assert not os.environ.get("SPARK_TESTING") # Sanity check for our internal usage. +assert isinstance(sql_ctx, SQLContext) +# We should remove this if-else branch in the future release, and rename +# sql_ctx to session in the constructor. This is an internal code path but +# was kept with a warning because it's used intensively by third-party libraries. +warnings.warn("DataFrame constructor is internal. Do not directly use it.") +self._sql_ctx = sql_ctx +session = sql_ctx.sparkSession +else: +session = sql_ctx +self._session: "SparkSession" = session + +self._sc: "SparkContext" = sql_ctx._sc +self._jdf: "JavaObject" = jdf +self.is_cached = False +# initialized lazily +self._schema: Optional[StructType] = None +self._lazy_rdd: Optional["RDD[Row]"] = None +# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice +# by __repr__ and _repr_html_ while eager evaluation opens. +self._support_repr_html = False + +@property +def sql_ctx(self) -> "SQLContext": +from
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1571898280 ## python/pyspark/sql/classic/dataframe.py: ## @@ -0,0 +1,1952 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import json +import sys +import random +import warnings +from collections.abc import Iterable +from functools import reduce +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + +from pyspark import _NoValue +from pyspark._globals import _NoValueType +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) +from pyspark.util import ( +is_remote_only, +_load_from_socket, +_local_iterator_from_socket, +) +from pyspark.serializers import BatchedSerializer, CPickleSerializer, UTF8Deserializer +from pyspark.storagelevel import StorageLevel +from pyspark.traceback_utils import SCCallSiteSync +from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column +from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2 +from pyspark.sql.streaming import DataStreamWriter +from pyspark.sql.types import ( +StructType, +Row, +_parse_datatype_json_string, +) +from pyspark.sql.dataframe import ( +DataFrame as ParentDataFrame, +DataFrameNaFunctions as ParentDataFrameNaFunctions, +DataFrameStatFunctions as ParentDataFrameStatFunctions, +) +from pyspark.sql.utils import get_active_spark_context, toJArray +from pyspark.sql.pandas.conversion import PandasConversionMixin +from pyspark.sql.pandas.map_ops import PandasMapOpsMixin + +if TYPE_CHECKING: +from py4j.java_gateway import JavaObject +from pyspark.core.rdd import RDD +from pyspark.core.context import SparkContext +from pyspark._typing import PrimitiveType +from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame +from pyspark.sql._typing import ( +ColumnOrName, +ColumnOrNameOrOrdinal, +LiteralType, +OptionalPrimitiveType, +) +from pyspark.sql.context import SQLContext +from pyspark.sql.session import SparkSession +from pyspark.sql.group import GroupedData +from pyspark.sql.observation import Observation + + +class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin): +def __new__( +cls, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +) -> "DataFrame": +self = object.__new__(cls) +self.__init__(jdf, sql_ctx) # type: ignore[misc] +return self + +def __init__( +self, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +): +from pyspark.sql.context import SQLContext + +self._sql_ctx: Optional["SQLContext"] = None + +if isinstance(sql_ctx, SQLContext): +assert not os.environ.get("SPARK_TESTING") # Sanity check for our internal usage. +assert isinstance(sql_ctx, SQLContext) +# We should remove this if-else branch in the future release, and rename +# sql_ctx to session in the constructor. This is an internal code path but +# was kept with a warning because it's used intensively by third-party libraries. +warnings.warn("DataFrame constructor is internal. Do not directly use it.") +self._sql_ctx = sql_ctx +session = sql_ctx.sparkSession +else: +session = sql_ctx +self._session: "SparkSession" = session + +self._sc: "SparkContext" = sql_ctx._sc +self._jdf: "JavaObject" = jdf +self.is_cached = False +# initialized lazily +self._schema: Optional[StructType] = None +self._lazy_rdd: Optional["RDD[Row]"] = None +# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice +# by __repr__ and _repr_html_ while eager evaluation opens. +self._support_repr_html = False + +@property +def sql_ctx(self) -> "SQLContext": +from
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
zhengruifeng commented on code in PR #46129: URL: https://github.com/apache/spark/pull/46129#discussion_r1571874148 ## python/pyspark/sql/classic/dataframe.py: ## @@ -0,0 +1,1952 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import json +import sys +import random +import warnings +from collections.abc import Iterable +from functools import reduce +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Sequence, +Tuple, +Type, +Union, +cast, +overload, +TYPE_CHECKING, +) + +from pyspark import _NoValue +from pyspark._globals import _NoValueType +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) +from pyspark.util import ( +is_remote_only, +_load_from_socket, +_local_iterator_from_socket, +) +from pyspark.serializers import BatchedSerializer, CPickleSerializer, UTF8Deserializer +from pyspark.storagelevel import StorageLevel +from pyspark.traceback_utils import SCCallSiteSync +from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column +from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2 +from pyspark.sql.streaming import DataStreamWriter +from pyspark.sql.types import ( +StructType, +Row, +_parse_datatype_json_string, +) +from pyspark.sql.dataframe import ( +DataFrame as ParentDataFrame, +DataFrameNaFunctions as ParentDataFrameNaFunctions, +DataFrameStatFunctions as ParentDataFrameStatFunctions, +) +from pyspark.sql.utils import get_active_spark_context, toJArray +from pyspark.sql.pandas.conversion import PandasConversionMixin +from pyspark.sql.pandas.map_ops import PandasMapOpsMixin + +if TYPE_CHECKING: +from py4j.java_gateway import JavaObject +from pyspark.core.rdd import RDD +from pyspark.core.context import SparkContext +from pyspark._typing import PrimitiveType +from pyspark.pandas.frame import DataFrame as PandasOnSparkDataFrame +from pyspark.sql._typing import ( +ColumnOrName, +ColumnOrNameOrOrdinal, +LiteralType, +OptionalPrimitiveType, +) +from pyspark.sql.context import SQLContext +from pyspark.sql.session import SparkSession +from pyspark.sql.group import GroupedData +from pyspark.sql.observation import Observation + + +class DataFrame(ParentDataFrame, PandasMapOpsMixin, PandasConversionMixin): +def __new__( +cls, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +) -> "DataFrame": +self = object.__new__(cls) +self.__init__(jdf, sql_ctx) # type: ignore[misc] +return self + +def __init__( +self, +jdf: "JavaObject", +sql_ctx: Union["SQLContext", "SparkSession"], +): +from pyspark.sql.context import SQLContext + +self._sql_ctx: Optional["SQLContext"] = None + +if isinstance(sql_ctx, SQLContext): +assert not os.environ.get("SPARK_TESTING") # Sanity check for our internal usage. +assert isinstance(sql_ctx, SQLContext) +# We should remove this if-else branch in the future release, and rename +# sql_ctx to session in the constructor. This is an internal code path but +# was kept with a warning because it's used intensively by third-party libraries. +warnings.warn("DataFrame constructor is internal. Do not directly use it.") +self._sql_ctx = sql_ctx +session = sql_ctx.sparkSession +else: +session = sql_ctx +self._session: "SparkSession" = session + +self._sc: "SparkContext" = sql_ctx._sc +self._jdf: "JavaObject" = jdf +self.is_cached = False +# initialized lazily +self._schema: Optional[StructType] = None +self._lazy_rdd: Optional["RDD[Row]"] = None +# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice +# by __repr__ and _repr_html_ while eager evaluation opens. +self._support_repr_html = False + +@property +def sql_ctx(self) -> "SQLContext": +from
Re: [PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]
HyukjinKwon commented on PR #46129: URL: https://github.com/apache/spark/pull/46129#issuecomment-2065807249 cc @ueshin @zhengruifeng @allisonwang-db @xinrong-meng @itholic @hvanhovell @grundprinzip -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org