[ 
https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311488#comment-15311488
 ] 

Jie Huang commented on SPARK-15393:
-----------------------------------

Yes. you are right. it really depends on the use case. If there is no data and 
the user still wants to get the schema info, we have to output certain schema 
file. 

> Writing empty Dataframes doesn't save any _metadata files
> ---------------------------------------------------------
>
>                 Key: SPARK-15393
>                 URL: https://issues.apache.org/jira/browse/SPARK-15393
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Jurriaan Pruis
>            Priority: Critical
>
> Writing empty dataframes is broken on latest master.
> It omits the metadata and sometimes throws the following exception (when 
> saving as parquet):
> {code}
> 8-May-2016 22:37:14 WARNING: 
> org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary 
> file for file:/some/test/file
> java.lang.NullPointerException
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
>     at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
>     at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
>     at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>     at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>     at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>     at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>     at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>     at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>     at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
>     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
>     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234)
>     at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>     at py4j.Gateway.invoke(Gateway.java:280)
>     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>     at py4j.GatewayConnection.run(GatewayConnection.java:211)
>     at java.lang.Thread.run(Thread.java:745)
> {code}
> It only saves an _SUCCESS file (which is also incorrect behaviour, because it 
> raised an exception).
> This means that loading it again will result in the following error:
> {code}
> Unable to infer schema for ParquetFormat at /some/test/file. It must be 
> specified manually;'
> {code}
> It looks like this problem was introduced in 
> https://github.com/apache/spark/pull/12855 (SPARK-10216).
> After reverting those changes I could save the empty dataframe as parquet and 
> load it again without Spark complaining or throwing any exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to