[
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-27921:
---------------------------------
Description:
This JIRA targets to improve Python test coverage in particular about
{{ExtractPythonUDFs}}.
This rule has caused many regressions or issues such as SPARK-27803,
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
We should convert *.sql test cases that can be affected by this rule
{{ExtractPythonUDFs}} like
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
Namely most of plan related test cases might have to be converted.
*Here is the rough contribution guide to follow:*
Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata
--------
OrderedDict([(b'pandas',
b'{"index_columns": [{"kind": "range", "name": null, "start": '
b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
b'mpy_type": "int64", "metadata": null}], "creator": {"library'
b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}
2. Keep the comments and state that this file was copied from {{xxx.sql}}, for
now.
3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more
combinations.
And it is not strict about where to insert.
5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z
udf/udf-xxx.sql"
git diff
# or diff xxx.sql.out udf/xxx.sql.out
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable
diff, open a PR.
7. If there are diff, file or find the JIRA, skip the tests with comments.
8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in
the PR description with the template below:
{code:java}
<details><summary>Diff comparing to 'xxx.sql'</summary>
<p>
```diff
... # here you put 'git diff' results
```
</p>
</details>
{code}
Note that registered UDFs all return strings - so there are some differences
are expected.
Note that this JIRA targets plan specific cases in general.
was:
This JIRA targets to improve Python test coverage in particular about
{{ExtractPythonUDFs}}.
This rule has caused many regressions or issues such as SPARK-27803,
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
We should convert *.sql test cases that can be affected by this rule
{{ExtractPythonUDFs}} like
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
Namely most of plan related test cases might have to be converted.
*Here is the rough contribution guide to follow:*
Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata
--------
OrderedDict([(b'pandas',
b'{"index_columns": [{"kind": "range", "name": null, "start": '
b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
b'mpy_type": "int64", "metadata": null}], "creator": {"library'
b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}
2. Keep the comments and state that this file was copied from {{xxx.sql}}, for
now.
3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more
combinations.
And it is not strict about where to insert.
5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z
udf/udf-xxx.sql"
git diff
# or diff xxx.sql.out udf/xxx.sql.out
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable
diff, open a PR.
7. If there are diff, file or find the JIRA, skip the tests with comments.
8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in
the PR description with the template below:
{code:java}
<details><summary>Diff comparing to 'xxx.sql'</summary>
<p>
```diff
... # here you put 'git diff' results
```
</p>
</details>
{code}
Note that registered UDFs all return strings - so there are some differences
are expected.
Note that this JIRA targets plan specific cases in general.
> Convert applicable *.sql tests into UDF integrated test base
> ------------------------------------------------------------
>
> Key: SPARK-27921
> URL: https://issues.apache.org/jira/browse/SPARK-27921
> Project: Spark
> Issue Type: Umbrella
> Components: PySpark, SQL
> Affects Versions: 3.0.0
> Reporter: Hyukjin Kwon
> Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about
> {{ExtractPythonUDFs}}.
> This rule has caused many regressions or issues such as SPARK-27803,
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
> We should convert *.sql test cases that can be affected by this rule
> {{ExtractPythonUDFs}} like
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
> Namely most of plan related test cases might have to be converted.
> *Here is the rough contribution guide to follow:*
> Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if
> you're able to do this:
> {code:java}
> >>> import pandas
> pan>>>
> >>> import pandas
> >>> pandas.__version__
> '0.23.4'
> >>> import pyarrow
> >>> pyarrow.__version__
> '0.13.0'
> >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
> pyarrow.Table
> a: int64
> metadata
> --------
> OrderedDict([(b'pandas',
> b'{"index_columns": [{"kind": "range", "name": null, "start": '
> b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
> b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
> b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
> b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
> b'mpy_type": "int64", "metadata": null}], "creator": {"library'
> b'": "pyarrow", "version": "0.13.0"}, "pandas_version":
> null}')])
> {code}
>
> 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}
> 2. Keep the comments and state that this file was copied from {{xxx.sql}},
> for now.
> 3. Run it below:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite --
> -z udf/udf-xxx.sql"
> git add .
> {code}
> 4. Insert {{udf(...)}} into each statement. It is not required to add more
> combinations.
> And it is not strict about where to insert.
> 5. Run it below again:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite --
> -z udf/udf-xxx.sql"
> git diff
> # or diff xxx.sql.out udf/xxx.sql.out
> {code}
> 6. Compare results with original file, {{xxx.sq}}`. If there are no notable
> diff, open a PR.
> 7. If there are diff, file or find the JIRA, skip the tests with comments.
> 8. Run without generating golden files and check:
> {code:java}
> build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
> {code}
> 9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in
> the PR description with the template below:
> {code:java}
> <details><summary>Diff comparing to 'xxx.sql'</summary>
> <p>
> ```diff
> ... # here you put 'git diff' results
> ```
> </p>
> </details>
> {code}
> Note that registered UDFs all return strings - so there are some differences
> are expected.
> Note that this JIRA targets plan specific cases in general.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]