[ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-9278.
---------------------------------
    Resolution: Not A Problem

I tried to reproduce the codes above.

{code}
import pandas

pdf = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 'k': ['a', 'e', 'i', 
'o', 'u']*3, 'v': range(15)})
sdf = spark.createDataFrame(pdf)
sdf.filter('FALSE').write.partitionBy('pk').saveAsTable('foo', 
format='parquet', path='/tmp/tmptable')
sdf.filter(sdf.pk == 'a').write.partitionBy('pk').insertInto('foo')
foo = spark.table('foo')
foo.show()
{code}

It seems now it produces an exception as below:

{code}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/readwriter.py", line 606, in insertInto
    self._jwrite.mode("overwrite" if overwrite else 
"append").insertInto(tableName)
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
1133, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"insertInto() can't be used together with 
partitionBy(). Partition columns have already be defined for the table. It is 
not necessary to use partitionBy().;"
{code}

I am resolving this per ...

{quote}
If the issue seems clearly obsolete and applies to issues or components that 
have changed radically since it was opened, resolve as Not a Problem
{quote}

Please reopen this if I was mistaken.

> DataFrameWriter.insertInto inserts incorrect data
> -------------------------------------------------
>
>                 Key: SPARK-9278
>                 URL: https://issues.apache.org/jira/browse/SPARK-9278
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: Linux, S3, Hive Metastore
>            Reporter: Steve Lindemann
>            Assignee: Cheng Lian
>            Priority: Critical
>
> After creating a partitioned Hive table (stored as Parquet) via the 
> DataFrameWriter.createTable command, subsequent attempts to insert additional 
> data into new partitions of this table result in inserting incorrect data 
> rows. Reordering the columns in the data to be written seems to avoid this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to