[PR] [SPARK-48620][PYTHON] Fix internal raw data leak in `YearMonthIntervalType` [spark]

via GitHub Thu, 13 Jun 2024 06:37:50 -0700


zhengruifeng opened a new pull request, #46975:
URL: https://github.com/apache/spark/pull/46975


   ### What changes were proposed in this pull request?
   Fix internal raw data leak in `YearMonthIntervalType`:
   
   1. PySpark Classic: it fails collection of `YearMonthIntervalType`
   2. PySpark Connect: it was never supported in the Python side arrow 
conversion, so just refine the error message
   
   
   ### Why are the changes needed?
   the raw data should not be leaked
   
   
   ### Does this PR introduce _any_ user-facing change?
   1, PySpark Classic (before):
   ```
   In [2]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()[0]
   Out[2]: 128
   ```
   
   1, PySpark Classic (after):
   ```
   In [1]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()[0]
   ---------------------------------------------------------------------------
   PySparkNotImplementedError                Traceback (most recent call last)
   Cell In[1], line 1
   ----> 1 spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()[0]
   
   ...
   
   File ~/Dev/spark/python/pyspark/sql/types.py:641, in 
YearMonthIntervalType.fromInternal(self, obj)
       640 def fromInternal(self, obj: Any) -> Any:
   --> 641     raise PySparkNotImplementedError(
       642         error_class="NOT_IMPLEMENTED",
       643         message_parameters={"feature": 
"YearMonthIntervalType.fromInternal"},
       644     )
   
   PySparkNotImplementedError: [NOT_IMPLEMENTED] 
YearMonthIntervalType.fromInternal is not implemented.
   ```
   
   
   2, PySpark Connect (before):
   ```
   In [1]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()[0]
   
[********************************************************************************]
 100.00% Complete (0 Tasks running, 1s, 
Scann[********************************************************************************]
 100.00% Complete (0 Tasks running, 1s, 
Scann[********************************************************************************]
 100.00% Complete (0 Tasks running, 1s, Scann                                   
                                                                                
            
---------------------------------------------------------------------------
   KeyError                                  Traceback (most recent call last)
   Cell In[1], line 1
   ----> 1 spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()[0]
   
   ...
   
   File ~/Dev/spark/python/pyspark/sql/connect/conversion.py:542, in 
ArrowTableToRowsConversion.convert(table, schema)
       536 assert schema is not None and isinstance(schema, StructType)
       538 field_converters = [
       539     ArrowTableToRowsConversion._create_converter(f.dataType) for f 
in schema.fields
       540 ]
   --> 542 columnar_data = [column.to_pylist() for column in table.columns]
       544 rows: List[Row] = []
       545 for i in range(0, table.num_rows):
   
   File 
~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/pyarrow/table.pxi:1327,
 in pyarrow.lib.ChunkedArray.to_pylist()
   
   File 
~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/pyarrow/table.pxi:1256,
 in pyarrow.lib.ChunkedArray.chunk()
   
   File 
~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/pyarrow/public-api.pxi:208,
 in pyarrow.lib.pyarrow_wrap_array()
   
   File 
~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/pyarrow/array.pxi:3711,
 in pyarrow.lib.get_array_class_from_type()
   
   KeyError: 21
   ```
   
   2, PySpark Connect (after):
   ```
   In [1]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()[0]
   
   ...
   
   File ~/Dev/spark/python/pyspark/sql/pandas/types.py:293, in 
from_arrow_type(at, prefer_timestamp_ntz)
       291     spark_type = NullType()
       292 else:
   --> 293     raise PySparkTypeError(
       294         error_class="UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION",
       295         message_parameters={"data_type": str(at)},
       296     )
       297 return spark_type
   
   PySparkTypeError: [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] 
month_interval is not supported in conversion to Arrow.
   
   ```
   
   ### How was this patch tested?
   Added test
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-48620][PYTHON] Fix internal raw data leak in `YearMonthIntervalType` [spark]

Reply via email to