[jira] [Comment Edited] (SPARK-50005) "Throws exception if outputPath tries to overwrite inputpath." But some scenes are not recognized.

loong (Jira) Thu, 17 Oct 2024 01:25:02 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-50005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890353#comment-17890353
 ]


loong edited comment on SPARK-50005 at 10/17/24 8:23 AM:
---------------------------------------------------------

SparkSQL will throw exception if outputPath tries to overwrite inputpath. You 
can see the specific validation method 'verifyNotReadPath()' in the ddl.scala. 
SparkSQL can identify simple scenario such as:
{code:sql}
insert overwrite table output_t select * from output_t;
{code}
However，SparkSQL cannot identify more complex scenarios where the subquery is 
hidden within filter conditions, such as：
{code:sql}
insert overwrite table output_t select * from input_t ta where not 
exists(select tb.id from output_t tb where tb.id = ta.id);
insert overwrite table output_t select * from input_t ta where ta.id in (select 
id from output_t );
insert overwrite table output_t select * from input_t ta where ta.id < (select 
max(tb.id) from output_t tb where tb.id=ta.id);
{code}
In these scenarios above, SparkSQL throws an exception with the message 
'java.io.FileNotFoundException: File does not exist' which can be confusing.


{panel:title=error message}
Caused by: java.io.FileNotFoundException: File does not exist: 
hdfs://xxx/xxx/xxx/xxx/xxx/xxxxxx.parquet
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:148)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:205)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:122)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:448)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1382)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)
        ... 3 more
{panel}



was (Author: JIRAUSER299610):
SparkSQL will throw exception if outputPath tries to overwrite inputpath. You 
can see the specific validation method 'verifyNotReadPath()' in the ddl.scala. 
SparkSQL can identify simple scenario such as:
{code:sql}
insert overwrite table output_t select * from output_t;
{code}
However，SparkSQL cannot identify more complex scenarios where the query is 
hidden within filter conditions, such as：
{code:sql}
insert overwrite table output_t select * from input_t ta where not 
exists(select tb.id from output_t tb where tb.id = ta.id);
insert overwrite table output_t select * from input_t ta where ta.id in (select 
id from output_t );
insert overwrite table output_t select * from input_t ta where ta.id < (select 
max(tb.id) from output_t tb where tb.id=ta.id);
{code}
In these scenarios above, SparkSQL throws an exception with the message 
'java.io.FileNotFoundException: File does not exist' which can be confusing.


{panel:title=error message}
Caused by: java.io.FileNotFoundException: File does not exist: 
hdfs://xxx/xxx/xxx/xxx/xxx/xxxxxx.parquet
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:148)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:205)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:122)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:448)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1382)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)
        ... 3 more
{panel}


> "Throws exception if outputPath tries to overwrite inputpath." But some 
> scenes are not recognized.
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-50005
>                 URL: https://issues.apache.org/jira/browse/SPARK-50005
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.8, 3.5.3
>            Reporter: loong
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-50005) "Throws exception if outputPath tries to overwrite inputpath." But some scenes are not recognized.

Reply via email to