[
https://issues.apache.org/jira/browse/SPARK-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121590#comment-15121590
]
Vincent Warmerdam commented on SPARK-13062:
-------------------------------------------
Target version -> my bad. Apologies!
Is there a better way to update the schema of an existing parquet file? That's
what caused me to try this pattern.
Even if the parquet action is invalid, it seems undesirable that the action is
allowed by the library, that it throws away the data on HDFS as a consequence
and then throws an error that data cannot be found.
The Spark-4538 describes the deletion of metadata (which I believe is found in
`_metadata`), this bug reports the deletion of the parquet data (the
`*.gz.parquet` files). The metadata seems to remain intact when doing this.
> Overwriting same file with new schema destroys original file.
> -------------------------------------------------------------
>
> Key: SPARK-13062
> URL: https://issues.apache.org/jira/browse/SPARK-13062
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.5.2
> Reporter: Vincent Warmerdam
>
> I am using Hadoop with Spark 1.5.2. Using pyspark, let's create two
> dataframes.
> {code}
> ddf1 = sqlCtx.createDataFrame(pd.DataFrame({'time':[1,2,3],
> 'thing':['a','b','b']}))
> ddf2 = sqlCtx.createDataFrame(pd.DataFrame({'time':[4,5,6,7],
> 'thing':['a','b','a','b'],
> 'name':['pi', 'ca', 'chu', '!']}))
> ddf1.printSchema()
> ddf2.printSchema()
> ddf1.write.parquet('/tmp/ddf1', mode = 'overwrite')
> ddf2.write.parquet('/tmp/ddf2', mode = 'overwrite')
> sqlCtx.read.load('/tmp/ddf1', schema=ddf2.schema).show()
> sqlCtx.read.load('/tmp/ddf2', schema=ddf1.schema).show()
> {code}
> Spark does a nice thing here, you can use different schemas consistently.
> {code}
> root
> |-- thing: string (nullable = true)
> |-- time: long (nullable = true)
> root
> |-- name: string (nullable = true)
> |-- thing: string (nullable = true)
> |-- time: long (nullable = true)
> +----+-----+----+
> |name|thing|time|
> +----+-----+----+
> |null| a| 1|
> |null| b| 3|
> |null| b| 2|
> +----+-----+----+
> +-----+----+
> |thing|time|
> +-----+----+
> | b| 7|
> | b| 5|
> | a| 4|
> | a| 6|
> +-----+----+
> {code}
> But here comes something naughty. Imagine that I want to update `ddf1` with
> the new schema and save this on the HDFS filesystem.
> I'll first write it to a new filename.
> {code}
> sqlCtx.read.load('/tmp/ddf1', schema=ddf1.schema)\
> .write.parquet('/tmp/ddf1_again', mode = 'overwrite')
> {code}
> Nothing seems to go wrong.
> {code}
> > sqlCtx.read.load('/tmp/ddf1_again', schema=ddf2.schema).show()
> +----+-----+----+
> |name|thing|time|
> +----+-----+----+
> |null| a| 1|
> |null| b| 2|
> |null| b| 3|
> +----+-----+----+
> {code}
> But what happens when I rewrite the file with a new schema. Note that the
> main difference is that I am attempting to rewrite the file. I am now using
> the same file name, not a different one.
> {code}
> sqlCtx.read.load('/tmp/ddf1_again', schema=ddf2.schema)\
> .write.parquet('/tmp/ddf1_again', mode = 'overwrite')
> {code}
> I get this big error.
> {code}
> Py4JJavaError: An error occurred while calling o97.parquet.
> : org.apache.spark.SparkException: Job aborted.
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: File does not exist:
> /tmp/ddf1_again/part-r-00000-052bae31-47bf-487e-873e-2753248ab7eb.gz.parquet
> ...
> {code}
> The error message seems odd, it cannot find the file. Turns out that the file
> existed before the command, but not after. Oh noes!
> {code}
> > sqlCtx.read.load('/tmp/ddf1_again', schema=ddf2.schema).show()
> +----+-----+----+
> |name|thing|time|
> +----+-----+----+
> +----+-----+----+
> {code}
> The schema still seems to be intact, but the data has been removed without
> prompting the user. I'm assuming something is going awry with the metadata
> when I am rewriting a parquet file with the same filename but with a
> different schema, but I would expect the thrown error to be an indication
> that nothing happened to the files on HDFS.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]