[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

Kalle Jepsen (JIRA) Mon, 30 Mar 2015 05:29:52 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386625#comment-14386625
 ]


Kalle Jepsen commented on SPARK-6189:
-------------------------------------

I do not really understand why the column names have to be accessible directly 
as attributes anyway. What advantage does this yield above indexing? This 
basically restricts us on the ASCII character set for column names, doesn't it? 
Data in the wild may have all kinds of weird field names, including special 
characters, umlauts, accents and whatnot. Automatic renaming isn't very nice 
too, for the very reason already pointed out by mgdadv. Also, we cannot simply 
replace all illegal characters by underscores. The fields {{'ä.ö'}} and 
{{'ä.ü'}} would both be renamed to {{'___'}}. Besides, leading underscores have 
a somewhat special meaning in Python, potentially resulting in further 
confusion.

I think {{df\['a.b'\]}} should definitely work, even if the columns contain 
non-ASCII characters and a warning should be issued when creating the 
DataFrame, informing the user that direct column access via attribute name will 
not work with the given column names.

> Pandas to DataFrame conversion should check field names for periods
> -------------------------------------------------------------------
>
>                 Key: SPARK-6189
>                 URL: https://issues.apache.org/jira/browse/SPARK-6189
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
> DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
> dataset had a column with a period in it (column "GNP.deflator" in the 
> "longley" dataset).  When I tried to select it using the Spark DataFrame DSL, 
> I could not because the DSL thought the period was selecting a field within 
> GNP.
> Also, since "GNP" is another field's name, it gives an error which could be 
> obscure to users, complaining:
> {code}
> org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
> type DoubleType;
> {code}
> We should either handle periods in column names or check during loading and 
> warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

Reply via email to