[jira] [Created] (SPARK-38395) Pyspark issue in resolving column when there is dot (.)

Krishna Sangeeth KS (Jira) Wed, 02 Mar 2022 09:57:05 -0800

Krishna Sangeeth KS created SPARK-38395:
-------------------------------------------


             Summary: Pyspark issue in resolving column when there is dot (.)
                 Key: SPARK-38395
                 URL: https://issues.apache.org/jira/browse/SPARK-38395
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.0.0
         Environment: Issue found in Mac OS Catalina, Pyspark 3.0
            Reporter: Krishna Sangeeth KS


Pyspark apply in pandas have some difficult in resolving columns when there is 
dot in the column name. 

Here is an example that I have which reproduces the issue.  Example taken by 
modifying doctest example 
[here|https://github.com/apache/spark/blob/branch-3.0/python/pyspark/sql/pandas/group_ops.py#L237-L248]
{code:python}
df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 
4.0)],
("abc|database|10.159.154|xef", "id", "v1"))
df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("abc|database|10.159.154|xef", "id", "v2"))
def asof_join(l, r):
    return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 double, v2 
string").show(){code}
This gives the below error
{code:python}
AnalysisException                         Traceback (most recent call last)
<ipython-input-126-b1807bb28ae3> in <module>
      8     return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", 
by="id")
      9 df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
---> 10 asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 
double, v2 string").show()

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py
 in applyInPandas(self, func, schema)
    295         udf = pandas_udf(
    296             func, returnType=schema, 
functionType=PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF)
--> 297         all_cols = self._extract_cols(self._gd1) + 
self._extract_cols(self._gd2)
    298         udf_column = udf(*all_cols)
    299         jdf = self._gd1._jgd.flatMapCoGroupsInPandas(self._gd2._jgd, 
udf_column._jc.expr())

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py
 in _extract_cols(gd)
    303     def _extract_cols(gd):
    304         df = gd._df
--> 305         return [df[col] for col in df.columns]
    306 
    307 

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py
 in <listcomp>(.0)
    303     def _extract_cols(gd):
    304         df = gd._df
--> 305         return [df[col] for col in df.columns]
    306 
    307 

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/dataframe.py in 
__getitem__(self, item)
   1378         """
   1379         if isinstance(item, basestring):
-> 1380             jc = self._jdf.apply(item)
   1381             return Column(jc)
   1382         elif isinstance(item, Column):

~/anaconda3/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py in 
__call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in 
deco(*a, **kw)
    135                 # Hide where the exception came from that shows a 
non-Pythonic
    136                 # JVM exception message.
--> 137                 raise_from(converted)
    138             else:
    139                 raise

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in 
raise_from(e)

AnalysisException: Cannot resolve column name "abc|database|10.159.154|xef" 
among (abc|database|10.159.154|xef, id, v1); did you mean to quote the 
`abc|database|10.159.154|xef` column?;

{code}
As we can see the column is present there in the `among` list.

When i replace `.` (dot) with `_` (underscore) the code actually works.
{code:python}
df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 
4.0)],
("abc|database|10_159_154|xef", "id", "v1"))
df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("abc|database|10_159_154|xef", "id", "v2"))
def asof_join(l, r):
    return pd.merge_asof(l, r, on="abc|database|10_159_154|xef", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="`abc|database|10_159_154|xef` int, id int, v1 double, v2 
string").show()
{code}
{code:java}
+---------------------------+---+---+---+
|abc|database|10_159_154|xef| id| v1| v2|
+---------------------------+---+---+---+
|                   20000101|  1|1.0|  x|
|                   20000102|  1|3.0|  x|
|                   20000101|  2|2.0|  y|
|                   20000102|  2|4.0|  y|
+---------------------------+---+---+---+
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-38395) Pyspark issue in resolving column when there is dot (.)

Reply via email to