[
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735710#comment-14735710
]
Zhan Zhang commented on SPARK-10304:
------------------------------------
Did more investigation. Currently all files are included (_common_metadata,
etc), for example:
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/_common_metadata
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/_metadata
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/id=71/part-r-00001-39ef2d6e-2832-4757-ac02-0a938eb83b7d.gz.parquet
On the framework level, the partition will be retrieved from both
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/
and
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/id=71/
In this case, the framework cannot differentiate valid and invalid directory.
[~lian cheng] Can we filter all unnecessary files, e.g., _metadata,
_common_metadata when doing the partition discovery by removing all files start
with . or underscore? I didn't see such files are useful for partition
discovery, but I may miss something. Otherwise, it seems to be hard to check
the validation of the directory.
cc [~yhuai]
> Partition discovery does not throw an exception if the dir structure is
> invalid
> -------------------------------------------------------------------------------
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Yin Huai
> Assignee: Zhan Zhang
> Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if
> it is stored as ORC, there will be the following NPE. But, if it is Parquet,
> we even can return rows. We should complain to users about the dir struct
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
> at
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
> at
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
> at
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
> at
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
> at scala.Option.map(Option.scala:145)
> at
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
> at
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
> at
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
> at
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]