[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735710#comment-14735710
 ] 

Zhan Zhang commented on SPARK-10304:
------------------------------------

Did more investigation. Currently all files are included (_common_metadata, 
etc), for example:
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/_common_metadata
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/_metadata
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/id=71/part-r-00001-39ef2d6e-2832-4757-ac02-0a938eb83b7d.gz.parquet

On the framework level, the partition will be retrieved from both 
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/
and 
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/id=71/

In this case, the framework cannot differentiate valid and invalid directory.
[~lian cheng] Can we filter all unnecessary files, e.g., _metadata, 
_common_metadata when doing the partition discovery by removing all files start 
with . or underscore? I didn't see such files are useful for partition 
discovery, but I may miss something. Otherwise, it seems to be hard to check 
the validation of the directory.
cc [~yhuai]


> Partition discovery does not throw an exception if the dir structure is 
> invalid
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-10304
>                 URL: https://issues.apache.org/jira/browse/SPARK-10304
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Yin Huai
>            Assignee: Zhan Zhang
>            Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>       at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>       at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>       at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>       at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>       at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>       at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>       at scala.Option.map(Option.scala:145)
>       at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>       at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>       at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>       at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to