ashrithb opened a new pull request, #53377:
URL: https://github.com/apache/spark/pull/53377

   ### What changes were proposed in this pull request?
   
   This PR fixes the `test_to_feather` test failure with PyArrow 22.0.0 by 
making `PlanMetrics`, `MetricValue`, and `PlanObservedMetrics` 
JSON-serializable.
   
   **Changes:**
   1. Added `to_dict()` method to `MetricValue` class 
(`python/pyspark/sql/metrics.py`)
   2. Added `to_dict()` method to `PlanMetrics` class 
(`python/pyspark/sql/metrics.py`)
   3. Added `to_dict()` method to `PlanObservedMetrics` class 
(`python/pyspark/sql/connect/client/core.py`)
   4. Modified `SparkConnectClient.to_pandas()` to convert metrics to 
dictionaries before storing in `pdf.attrs`
   5. Removed the `@unittest.skipIf` workaround from `test_to_feather` test
   
   ### Why are the changes needed?
   
   PyArrow 22.0.0 changed its behavior to serialize pandas `DataFrame.attrs` to 
JSON metadata when writing Feather files. Since PySpark Spark Connect was 
storing raw `PlanMetrics` objects in `pdf.attrs["metrics"]`, this caused a 
`TypeError`: TypeError: Object of type PlanMetrics is not JSON serializable
   when serializing list item 0
   when serializing dict item 'metrics'
   when serializing dict item 'attributes'
   
   
   This broke the 
`pyspark.pandas.tests.connect.io.test_parity_feather.FeatherParityTests.test_to_feather`
 test and any user code using `DataFrame.to_feather()` after calling 
`toPandas()` on a Spark Connect DataFrame.
   
   ### Does this PR introduce any user-facing change?
   
   Minimal. The metrics stored in `pdf.attrs["metrics"]` are now dictionaries 
instead of `PlanMetrics` objects. This is technically a change, but:
   
   1. `pdf.attrs` is not a documented public API for accessing metrics
   2. Users should access metrics through `ExecutionInfo` instead
   3. The dict format is more portable (JSON-serializable, printable, easier to 
inspect)
   
   If any users were directly accessing `pdf.attrs["metrics"][i].name`, they 
would need to change to `pdf.attrs["metrics"][i]["name"]`. However, this usage 
pattern is undocumented and not any official API or right way to access them, 
they would be accessing this through `ExecutionInfo.metrics` anyways.
   
   ### How was this patch tested?
   
   - Unit tests for `MetricValue.to_dict()` verifying correct structure and 
JSON serializability
   - Unit tests for `PlanMetrics.to_dict()` verifying nested metrics conversion
   - Integration tests confirming:
     - Raw `PlanMetrics` objects fail with PyArrow 22.0.0 (expected, confirms 
the bug)
     - Dict-converted metrics work with PyArrow 22.0.0 (confirms the fix), also 
ran with previous versions of PyArrow to confirm backward compatibility
     - Data roundtrips correctly through feather format
   - Removed the `@unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068")` 
workaround so the test now runs on all PyArrow versions
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to