Jason Piper created SPARK-13455:
-----------------------------------

             Summary: Periods in dataframe column names breaks df.drop(<string>)
                 Key: SPARK-13455
                 URL: https://issues.apache.org/jira/browse/SPARK-13455
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 1.6.0
         Environment: Spark 1.6.0 installed via homebrew
            Reporter: Jason Piper
            Priority: Minor


When calling the .drop method using a string on a dataframe that contains a 
column name with a period in it, an AnalysisException is raised. This doesn't 
happen when dropping using the column object itself.

{code}
>>> import json
>>> ds = {'a': "test", "b.no": "testagain"}
>>> df = sqlContext.jsonRDD(sc.parallelize([json.dumps(ds)]))
>>> df.drop('a')
{code}

yields

{code}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/dataframe.py", 
line 1347, in drop
    jdf = self._jdf.drop(col)
  File 
"/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
 line 813, in __call__
  File 
"/usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/utils.py", 
line 51, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'b.no' given input 
columns a, b.no;"
{code}

whereas this works,

{code}
>>> df.drop(df.a)
DataFrame[b.no: string]
{code}

current workaround if you want to drop a column using a string is to use

{code}
>>> df.drop(df.select("a")[0])
DataFrame[b.no: string]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to