[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

Robert Joseph Evans (Jira) Mon, 13 Nov 2023 07:17:08 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Joseph Evans updated SPARK-45879:
----------------------------------------
    Affects Version/s: 3.4.1
                       3.2.3

> Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-45879
>                 URL: https://issues.apache.org/jira/browse/SPARK-45879
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.3, 3.4.1, 3.5.0
>         Environment: I tried on Spark 323 and Spark 341, and both can 
> reproduce this issue.
>            Reporter: Liangcai li
>            Priority: Major
>
> When doing a join with the "input_file_name()" function, it will blow up with 
> a
> *AnalysisException* if using the v1 data source (FileSourceScan). That's ok.
>  
> But if we change to use the v2 data source (BatchScan), the expected 
> exception is gone, and the join passes.
> Is this number check for InputFileDataSources mssing for V2 data source ? or 
> is it by design ?
>  
> Repro steps:
> {code:java}
> scala> spark.range(100).withColumn("const1", 
> lit("from_t1")).write.parquet("/data/tmp/t1")
>  
> scala> spark.range(100).withColumn("const2", 
> lit("from_t2")).write.parquet("/data/tmp/t2")
>  
> scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
>  
> scala> 
> spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
> "id", "inner").selectExpr("*", "input_file_name()").show(5, false)
> org.apache.spark.sql.AnalysisException: 'input_file_name' does not support 
> more than one sources.; line 1 pos 0;
> Project id#376L, const1#377, const2#381, input_file_name() AS 
> input_file_name()#389
> +- Project id#376L, const1#377, const2#381
>    +- Join Inner, (id#376L = id#380L)
>       :- Relation id#376L,const1#377 parquet
>       +- Relation id#380L,const2#381 parquet
>  
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
>   at 
> org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
>   at 
> org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
>   at 
> org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
>  
> scala> 
> spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
> "id", "inner").selectExpr("*", "input_file_name()").show(5, false)
> +---+-------+-------+---------------------------------------------------------------------------------------+
> |id |const1 |const2 |input_file_name()                                        
>                               |
> +---+-------+-------+---------------------------------------------------------------------------------------+
> |91 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |92 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |93 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |94 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |95 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> +---+-------+-------+---------------------------------------------------------------------------------------+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

Reply via email to