[GitHub] [spark] zhengruifeng opened a new pull request, #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

via GitHub Tue, 05 Sep 2023 22:05:09 -0700


zhengruifeng opened a new pull request, #42828:
URL: https://github.com/apache/spark/pull/42828


   ### What changes were proposed in this pull request?
   
   - Make `getitem` work with duplicated columns
   - Disallow bool type index
   - Disallow negative index
   
   
   ### Why are the changes needed?
   1, SQL feature OrderBy ordinal works with duplicated columns
   ```
   In [4]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), 
(4, 4.4, 'c') AS TAB(a, a, a)")
   
   In [5]: df.createOrReplaceTempView("v")
   
   In [6]: spark.sql("SELECT * FROM v ORDER BY 1, 2").show()
   +---+---+---+
   |  a|  a|  a|
   +---+---+---+
   |  1|1.1|  a|
   |  2|2.2|  b|
   |  4|4.4|  c|
   +---+---+---+
   ```
   
   To support it in DataFame APIs, we need to make `getitem` work with 
duplicated columns
   
   
   2 & 3: the support should be unintentional
   
   
   ### Does this PR introduce _any_ user-facing change?
   1, Make `getitem` work with duplicated columns
   ```
   In [6]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), 
(4, 4.4, 'c') AS TAB(a, a, a)")
   
   In [7]: df.orderBy(1, 2).show()
   ---------------------------------------------------------------------------
   AnalysisException                         Traceback (most recent call last)
   Cell In[7], line 1
   ----> 1 df.orderBy(1, 2).show()
   
   ...
   
   AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could 
be: [`TAB`.`a`, `TAB`.`a`, `TAB`.`a`].
   
   ```
   
   after
   
   ```
   In [1]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), 
(4, 4.4, 'c') AS TAB(a, a, a)")
   
   In [2]: df[0]
   Out[2]: Column<'a'>
   
   In [3]: df[1]
   Out[3]: Column<'a'>
   
   In [4]: df.orderBy(1, 2).show()
   +---+---+---+
   |  a|  a|  a|
   +---+---+---+
   |  1|1.1|  a|
   |  2|2.2|  b|
   |  4|4.4|  c|
   +---+---+---+
   ```
   
   
   2, Disallow bool type index
   before
   ```
   In [1]: df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], 
schema=["age", "name"],)
   
   In [2]: df[False]
   Out[2]: Column<'age'>
   
   In [3]: df[True]
   Out[3]: Column<'name'>
   ```
   
   after
   ```
   In [2]: df[True]
   ---------------------------------------------------------------------------
   PySparkTypeError                          Traceback (most recent call last)
   ...
   PySparkTypeError: [NOT_COLUMN_OR_FLOAT_OR_INT_OR_LIST_OR_STR] Argument 
`item` should be a column, float, integer, list or string, got bool.
   ```
   
   3, Disallow negative index
   before
   ```
   In [1]: df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], 
schema=["age", "name"],)
   
   In [4]: df[-1]
   Out[4]: Column<'name'>
   
   In [5]: df[-2]
   Out[5]: Column<'age'>
   ```
   
   after
   ```
   In [3]: df[-1]
   ---------------------------------------------------------------------------
   IndexError                                Traceback (most recent call last)
   ...
   IndexError: Column index must be non-negative but got -1
   ```
   
   
   
   ### How was this patch tested?
   added UTs
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   NO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng opened a new pull request, #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Reply via email to