[ 
https://issues.apache.org/jira/browse/SPARK-26240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26240.
----------------------------------
    Resolution: Incomplete

Resolving this due to no feedback from reporter.

> [pyspark] Updating illegal column names with withColumnRenamed does not 
> change schema changes, causing pyspark.sql.utils.AnalysisException
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26240
>                 URL: https://issues.apache.org/jira/browse/SPARK-26240
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>         Environment: Ubuntu 16.04 LTS (x86_64/deb)
>  
>            Reporter: Ying Wang
>            Priority: Major
>
> I am unfamiliar with the internals of Spark, but I tried to ingest a Parquet 
> file with illegal column headers, and when I had called df = 
> df.withColumnRenamed($COLUMN_NAME, $NEW_COLUMN_NAME) and then called 
> df.show(), pyspark errored out with the failed attribute being the old column 
> name.
> Steps to reproduce:
> - Create a Parquet file from Pandas using this dataframe schema:
> ```python
> In [10]: df.info()
> <class 'pandas.core.frame.DataFrame'>
> Int64Index: 1000 entries, 0 to 999
> Data columns (total 16 columns):
> Record_ID 1000 non-null int64
> registration_dttm 1000 non-null object
> id 1000 non-null int64
> first_name 984 non-null object
> last_name 1000 non-null object
> email 984 non-null object
> gender 933 non-null object
> ip_address 1000 non-null object
> cc 709 non-null float64
> country 1000 non-null object
> birthdate 803 non-null object
> salary 932 non-null float64
> title 803 non-null object
> comments 179 non-null object
> Unnamed: 14 10 non-null object
> Unnamed: 15 9 non-null object
> dtypes: float64(2), int64(2), object(12)
> memory usage: 132.8+ KB
> ```
>  * Open pyspark shell with `pyspark` and read in the Parquet file with 
> `spark.read.format('parquet').load('/path/to/file.parquet')
> Call `spark_df.show()` Note the error with column 'Unnamed: 14'.
> Rename column, replacing illegal space character with underscore character: 
> `spark_df = spark_df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')`
> Call `spark_df.show()` again, and note that the error still shows attribute 
> 'Unnamed: 14' in the error message:
> ```python
> >>> df = spark.read.parquet('/home/yingw787/Downloads/userdata1.parquet')
> >>> newdf = df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')
> >>> newdf.show()
> Traceback (most recent call last):
>  File 
> "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py",
>  line 63, in deco
>  return f(*a, **kw)
>  File 
> "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.showString.
> : org.apache.spark.sql.AnalysisException: Attribute name "Unnamed: 14" 
> contains invalid character(s) among " ,;{}()\n\t=". Please use alias to 
> rename it.;
> ...
> ```
> I would have thought that there would be a way in order to read in Parquet 
> files such that illegal column names can be changed after the fact with the 
> spark dataframe was generated, and thus this is unintended behavior. Please 
> let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to