[
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357610#comment-14357610
]
mgdadv commented on SPARK-6189:
-------------------------------
While the dot is legal in R and SQL, I don't think there is a nice way of
making it
legal in python. So at least in the Spark python code, I think something should
be done about it.
I just realized that the automatic renaming can cause problems if that entry
already exists. For example, what if GNP_deflator was already in the data set
and then GNP.deflator gets changed.
I think the best thing to do is to just warn the user by printing out a warning
message. I have changed the patch accordingly.
Here is some example code for pyspark:
import pandas as pd
df = pd.read_csv(StringIO.StringIO("a.b,a,c\n101,102,103\n201,202,203"))
spdf = sqlCtx.createDataFrame(df)
spdf.take(2)
spdf[spdf.a==102].take(2)
So far this works, but this fails:
spdf[spdf.a.b==101].take(2)
In pandas df.a.b doesn't work either, but the fields can be accessed via the
string "a.b", i.e.:
df["a.b"]
> Pandas to DataFrame conversion should check field names for periods
> -------------------------------------------------------------------
>
> Key: SPARK-6189
> URL: https://issues.apache.org/jira/browse/SPARK-6189
> Project: Spark
> Issue Type: Improvement
> Components: DataFrame, SQL
> Affects Versions: 1.3.0
> Reporter: Joseph K. Bradley
> Priority: Minor
>
> Issue I ran into: I imported an R dataset in CSV format into a Pandas
> DataFrame and then use toDF() to convert that into a Spark DataFrame. The R
> dataset had a column with a period in it (column "GNP.deflator" in the
> "longley" dataset). When I tried to select it using the Spark DataFrame DSL,
> I could not because the DSL thought the period was selecting a field within
> GNP.
> Also, since "GNP" is another field's name, it gives an error which could be
> obscure to users, complaining:
> {code}
> org.apache.spark.sql.AnalysisException: GetField is not valid on fields of
> type DoubleType;
> {code}
> We should either handle periods in column names or check during loading and
> warn/fail gracefully.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]