[
https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-18789.
----------------------------------
Resolution: Cannot Reproduce
I can't reproduce this in the current master with the codes below:
{code}
from pyspark.sql import Row
from pyspark.sql.types import *
df = spark.createDataFrame(sc.parallelize(range(1, 100))
.map(lambda i: Row(single=i, square=i ** 2)))
df.write.format("orc").save("hdfs://localhost:9000/tmp/squares",
mode='overwrite')
{code}
Please reopen this if you still face this problem in the current master or
higher version.
It would be nicer if someone identifies the JIRA fixing this and backports if
applicable.
> Save Data frame with Null column-- exception
> --------------------------------------------
>
> Key: SPARK-18789
> URL: https://issues.apache.org/jira/browse/SPARK-18789
> Project: Spark
> Issue Type: Bug
> Affects Versions: 2.0.2
> Reporter: Harish
>
> I am trying to save a DF to HDFS which is having 1 column is NULL(no data).
> col1 col2 col3
> a 1 null
> b 1 null
> c 1 null
> d 1 null
> code : df.write.format("orc").save(path, mode='overwrite')
> Error:
> java.lang.IllegalArgumentException: Error: type expected at the position 49
> of 'string:string:string:double:string:double:string:null' but 'null' is
> found.
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
> at
> org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:182)
> at
> org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:225)
> at
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
> at
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4
> times; aborting job
> 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in
> stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage
> 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error:
> type expected at the position 49 of
> 'string:string:string:double:string:double:string:null' but 'null' is found.
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
> at
> org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:182)
> at
> org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:225)
> at
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
> at
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1923)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:143)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalArgumentException: Error: type expected at the
> position 49 of 'string:string:string:double:string:double:string:null' but
> 'null' is found.
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
> at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
> at
> org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:182)
> at
> org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:225)
> at
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
> at
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ... 1 more
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 12.3 in stage 512.0 (TID
> 37297, 10.63.136.108): TaskKilled (killed intentionally)
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 11.3 in stage 512.0 (TID
> 37295, 10.63.136.108): TaskKilled (killed intentionally)
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 16.3 in stage 512.0 (TID
> 37296, 10.63.136.108): TaskKilled (killed intentionally)
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 8.3 in stage 512.0 (TID
> 37299, 10.63.136.108): TaskKilled (killed intentionally)
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 0.3 in stage 512.0 (TID
> 37298, 10.63.136.108): TaskKilled (killed intentionally)
> 16/12/08 19:41:49 ERROR DefaultWriterContainer: Job job_201612081941_0000
> aborted.
> Traceback (most recent call last):
> File "scripts/abc.py", line 134, in <module>
> <<code>>
> File
> "/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
> line 547, in save
> File
> "/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
> line 1133, in __call__
> File
> "/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 63, in deco
> File
> "/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o1857.save.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]