[ https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829906#comment-15829906 ]
Eugen Prokhorenko edited comment on SPARK-18789 at 1/19/17 1:22 PM: -------------------------------------------------------------------- Hello, I would want to investigate this issue and, in case if I succeeded to come closer to the root of the issue, to provide with any results I got. Currently, I'm just trying to write a data frame onto HDFS in ORC format. When I'm sure that it works I will proceed with saving the data frame with the problematic schema. Here's a repository, in which I'm going to do the investigation: https://github.com/eugenzyx/SparkOrc (Spark 2.0.2, Scala 2.11.8) was (Author: eugenzyx): Hello, I would want to investigate this issue and, in case if I succeeded to come closer to the root of the issue, to provide with any results I got. Currently, I'm just trying to write a data frame onto HDFS. When I'm sure that it works I will proceed with saving the data frame with the problematic schema. Here's a repository, in which I'm going to do the investigation: https://github.com/eugenzyx/SparkOrc (Spark 2.0.2, Scala 2.11.8) > Save Data frame with Null column-- exception > -------------------------------------------- > > Key: SPARK-18789 > URL: https://issues.apache.org/jira/browse/SPARK-18789 > Project: Spark > Issue Type: Bug > Affects Versions: 2.0.2 > Reporter: Harish > > I am trying to save a DF to HDFS which is having 1 column is NULL(no data). > col1 col2 col3 > a 1 null > b 1 null > c 1 null > d 1 null > code : df.write.format("orc").save(path, mode='overwrite') > Error: > java.lang.IllegalArgumentException: Error: type expected at the position 49 > of 'string:string:string:double:string:double:string:null' but 'null' is > found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 > times; aborting job > 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in > stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage > 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: > type expected at the position 49 of > 'string:string:string:double:string:double:string:null' but 'null' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1923) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.IllegalArgumentException: Error: type expected at the > position 49 of 'string:string:string:double:string:double:string:null' but > 'null' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > ... 1 more > 16/12/08 19:41:49 WARN TaskSetManager: Lost task 12.3 in stage 512.0 (TID > 37297, 10.63.136.108): TaskKilled (killed intentionally) > 16/12/08 19:41:49 WARN TaskSetManager: Lost task 11.3 in stage 512.0 (TID > 37295, 10.63.136.108): TaskKilled (killed intentionally) > 16/12/08 19:41:49 WARN TaskSetManager: Lost task 16.3 in stage 512.0 (TID > 37296, 10.63.136.108): TaskKilled (killed intentionally) > 16/12/08 19:41:49 WARN TaskSetManager: Lost task 8.3 in stage 512.0 (TID > 37299, 10.63.136.108): TaskKilled (killed intentionally) > 16/12/08 19:41:49 WARN TaskSetManager: Lost task 0.3 in stage 512.0 (TID > 37298, 10.63.136.108): TaskKilled (killed intentionally) > 16/12/08 19:41:49 ERROR DefaultWriterContainer: Job job_201612081941_0000 > aborted. > Traceback (most recent call last): > File "scripts/abc.py", line 134, in <module> > <<code>> > File > "/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", > line 547, in save > File > "/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File > "/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > File > "/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o1857.save. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org