[jira] [Commented] (SPARK-27442) ParquetFileFormat fails to read column named with invalid characters

Dror Speiser (Jira) Tue, 17 Aug 2021 09:30:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400478#comment-17400478
 ]


Dror Speiser commented on SPARK-27442:
--------------------------------------

Hey, I'm going over the parquet format specification (github page and thrift 
file), and I don't see any mention of valid or invalid characters for field 
names in schema elements. Was this a restriction in earlier format 
specifications? 

> ParquetFileFormat fails to read column named with invalid characters
> --------------------------------------------------------------------
>
>                 Key: SPARK-27442
>                 URL: https://issues.apache.org/jira/browse/SPARK-27442
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.0, 2.4.1
>            Reporter: Jan Vršovský
>            Priority: Minor
>
> When reading a parquet file which contains characters considered invalid, the 
> reader fails with exception:
> Name: org.apache.spark.sql.AnalysisException
> Message: Attribute name "..." contains invalid character(s) among " 
> ,;{}()\n\t=". Please use alias to rename it.
> Spark should not be able to write such files, but it should be able to read 
> it (and allow the user to correct it). However, possible workarounds (such as 
> using alias to rename the column, or forcing another schema) do not work, 
> since the check is done on the input.
> (Possible fix: remove superficial 
> {{ParquetWriteSupport.setSchema(requiredSchema, hadoopConf)}} from 
> {{buildReaderWithPartitionValues}} ?)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-27442) ParquetFileFormat fails to read column named with invalid characters

Reply via email to