HyukjinKwon opened a new pull request #30098:
URL: https://github.com/apache/spark/pull/30098
### What changes were proposed in this pull request?
Some tests fail with PyArrow 2.0.0+:
```
======================================================================
ERROR [0.774s]: test_grouped_over_window_with_key
(pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line
595, in test_grouped_over_window_with_key
.select('id', 'result').collect()
File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in
collect
sock_info = self._jdf.collectToPython()
File
"/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line
1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace
below.
Traceback (most recent call last):
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
601, in main
process()
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
593, in process
serializer.dump_stream(out_iter, outfile)
File
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
line 255, in dump_stream
return ArrowStreamSerializer.dump_stream(self,
init_stream_yield_batches(), stream)
File
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
line 81, in dump_stream
for batch in iterator:
File
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
line 248, in init_stream_yield_batches
for series in iterator:
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
426, in mapper
return f(keys, vals)
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
170, in <lambda>
return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
158, in wrapped
result = f(key, pd.concat(value_series, axis=1))
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68,
in wrapper
return f(*args, **kwargs)
File
"/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line
590, in f
"{} != {}".format(expected_key[i][1], window_range)
AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end':
datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3,
15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3,
20, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}
```
https://github.com/apache/spark/runs/1278917457
This PR proposes to set the upper bound of PyArrow in GitHub Actions build.
This should be removed when we properly support PyArrow 2.0.0+ (SPARK-33189).
### Why are the changes needed?
To make build pass.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GitHub Actions in this build will test it out.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]