This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 8ac083c08ef8 [SPARK-55674][PYTHON] Optimize 0-column table conversion
in Spark Connect
8ac083c08ef8 is described below
commit 8ac083c08ef8ca4b1d7d1baa84b20fbc119adbeb
Author: Yicong-Huang <[email protected]>
AuthorDate: Wed Feb 25 15:56:27 2026 +0900
[SPARK-55674][PYTHON] Optimize 0-column table conversion in Spark Connect
### What changes were proposed in this pull request?
Replace `pa.Table.from_struct_array(pa.array([{}] * len(data),
type=pa.struct([])))` with
`pa.Table.from_batches([pa.RecordBatch.from_pandas(data)])` in
`connect/session.py` when handling 0-column pandas DataFrames. This is O(1)
operation, regardless how many rows are there.
### Why are the changes needed?
The original approach constructs `len(data)` Python dict objects (`[{}] *
len(data)`), which is O(n). `pa.RecordBatch.from_pandas` is an O(1) operation
regardless of the number of rows, as it reads row
count directly from pandas index metadata without allocating per-row
Python objects.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #54468 from
Yicong-Huang/SPARK-55674/followup/unify-zero-column-pandas-arrow-fix.
Authored-by: Yicong-Huang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/sql/connect/session.py | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/python/pyspark/sql/connect/session.py
b/python/pyspark/sql/connect/session.py
index 384d10c2ae58..f9a360ec6054 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -622,8 +622,9 @@ class SparkSession:
safecheck =
configs["spark.sql.execution.pandas.convertToArrowArraySafely"]
# Handle the 0-column case separately to preserve row count.
+ # pa.RecordBatch.from_pandas preserves num_rows via pandas index
metadata.
if len(data.columns) == 0:
- _table = pa.Table.from_struct_array(pa.array([{}] * len(data),
type=pa.struct([])))
+ _table =
pa.Table.from_batches([pa.RecordBatch.from_pandas(data)])
else:
_table = pa.Table.from_batches(
[
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]