[jira] [Commented] (SPARK-40600) Support recursiveFileLookup for partitioned datasource

Zhen Wang (Jira) Thu, 29 Sep 2022 18:16:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-40600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611315#comment-17611315
 ]


Zhen Wang commented on SPARK-40600:
-----------------------------------

There is a related implementation in [https://github.com/lyft/spark/pull/40.]

However, I tested and found the following issues:

1. Querying a non-partitioned table with .staging subdirectory has no data.

2. Makes some sql parsing very slow:
{code:java}
org.apache.hadoop.fs.Path.equals(Path.java:400)
scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:137)
scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
scala.collection.LinearSeqOptimized.contains(LinearSeqOptimized.scala:105)
scala.collection.LinearSeqOptimized.contains$(LinearSeqOptimized.scala:102)
scala.collection.immutable.Stream.contains(Stream.scala:204)
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.getRootPathsLeafDir(InMemoryFileIndex.scala:112)
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.$anonfun$refresh0$2(InMemoryFileIndex.scala:102)
org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$Lambda$3720/369851600.apply(Unknown
 Source)
scala.collection.TraversableLike$grouper$1$.apply(TraversableLike.scala:465)
scala.collection.TraversableLike$grouper$1$.apply(TraversableLike.scala:455)
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
scala.collection.TraversableLike.groupBy(TraversableLike.scala:524)
scala.collection.TraversableLike.groupBy$(TraversableLike.scala:454)
scala.collection.mutable.ArrayOps$ofRef.groupBy(ArrayOps.scala:198)
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:102)
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:69)
org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:90)
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:98)
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:78)
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$1229/1024786495.apply(Unknown
 Source)
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317) 
{code}
 

 

 

> Support recursiveFileLookup for partitioned datasource
> ------------------------------------------------------
>
>                 Key: SPARK-40600
>                 URL: https://issues.apache.org/jira/browse/SPARK-40600
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.1
>         Environment: Spark: 3.1.1
> Hive: 3.1.2
>            Reporter: Zhen Wang
>            Priority: Major
>
> I use hive tez engine to execute union statement and insert into partitioned 
> table may generate HIVE_UNION_SUBDIR subdirectory, and when I use spark sql 
> to read this partitioned table, the data below HIVE_UNION_SUBDIR is not read.
> For non-partitioned table, I can read the subdirectories of the table when 
> setting recursiveFileLookup to true, but for partitioned table, it seems 
> impossible to set recursiveFileLookup to true.
> So I want to support recursiveFileLookup for partitioned table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-40600) Support recursiveFileLookup for partitioned datasource

Reply via email to