itholic commented on PR #42798:
URL: https://github.com/apache/spark/pull/42798#issuecomment-1705851281
@zhengruifeng I think the problem is that the Pandas compute the concat
without sorting, so the result can be difficult when the index is not sorted as
below:
## Problem
**Pandas**
```python
>>> pdf
A B
4 a 1
3 b 2
2 c 3
>>> pdf.sum()
A abc
B 6
dtype: object
```
**Pandas API on Spark**
```python
>>> psdf
A B
4 a 1
3 b 2
2 c 3
>>> psdf.sum()
A cba # we internally sorted the index, so the result is different from
Pandas
B 6
dtype: object
```
## Solution
I think for now we can pick the one of three ways below:
1. We can document the warning note as below:
```
The result for string type column is non-deterministic since the
implementation depends on `collect_list` API from PySpark which is
non-deterministic as well.
```
2. We can `collect_list` both value and index, and sort by the indices
before `concat_ws` as you suggested, and document the warning note as below:
```
The result for string type column can be different from Pandas when the
index is not sorted, since we always sort the indexes before computing since
the implementation depends on `collect_list` API from PySpark which is
non-deterministic.
```
3. We don't support the string type column like so far, and add a note that
why we don't support the string type column as below:
```
String type column is not support for now, because it might yield
non-deterministic results unlike in Pandas.
```
WDYT? Also cc @HyukjinKwon, @ueshin @xinrong-meng , What strategy do we take
for this situation? I believe that the same rules should apply to similar cases
that already exist or may arise in the future.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]