[
https://issues.apache.org/jira/browse/SPARK-50005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890353#comment-17890353
]
loong edited comment on SPARK-50005 at 10/17/24 8:23 AM:
---------------------------------------------------------
SparkSQL will throw exception if outputPath tries to overwrite inputpath. You
can see the specific validation method 'verifyNotReadPath()' in the ddl.scala.
SparkSQL can identify simple scenario such as:
{code:sql}
insert overwrite table output_t select * from output_t;
{code}
However,SparkSQL cannot identify more complex scenarios where the subquery is
hidden within filter conditions, such as:
{code:sql}
insert overwrite table output_t select * from input_t ta where not
exists(select tb.id from output_t tb where tb.id = ta.id);
insert overwrite table output_t select * from input_t ta where ta.id in (select
id from output_t );
insert overwrite table output_t select * from input_t ta where ta.id < (select
max(tb.id) from output_t tb where tb.id=ta.id);
{code}
In these scenarios above, SparkSQL throws an exception with the message
'java.io.FileNotFoundException: File does not exist' which can be confusing.
{panel:title=error message}
Caused by: java.io.FileNotFoundException: File does not exist:
hdfs://xxx/xxx/xxx/xxx/xxx/xxxxxx.parquet
It is possible the underlying files have been updated. You can explicitly
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in
SQL or by recreating the Dataset/DataFrame involved.
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:148)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:205)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:122)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:448)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1382)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)
... 3 more
{panel}
was (Author: JIRAUSER299610):
SparkSQL will throw exception if outputPath tries to overwrite inputpath. You
can see the specific validation method 'verifyNotReadPath()' in the ddl.scala.
SparkSQL can identify simple scenario such as:
{code:sql}
insert overwrite table output_t select * from output_t;
{code}
However,SparkSQL cannot identify more complex scenarios where the query is
hidden within filter conditions, such as:
{code:sql}
insert overwrite table output_t select * from input_t ta where not
exists(select tb.id from output_t tb where tb.id = ta.id);
insert overwrite table output_t select * from input_t ta where ta.id in (select
id from output_t );
insert overwrite table output_t select * from input_t ta where ta.id < (select
max(tb.id) from output_t tb where tb.id=ta.id);
{code}
In these scenarios above, SparkSQL throws an exception with the message
'java.io.FileNotFoundException: File does not exist' which can be confusing.
{panel:title=error message}
Caused by: java.io.FileNotFoundException: File does not exist:
hdfs://xxx/xxx/xxx/xxx/xxx/xxxxxx.parquet
It is possible the underlying files have been updated. You can explicitly
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in
SQL or by recreating the Dataset/DataFrame involved.
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:148)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:205)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:122)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:448)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1382)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)
... 3 more
{panel}
> "Throws exception if outputPath tries to overwrite inputpath." But some
> scenes are not recognized.
> --------------------------------------------------------------------------------------------------
>
> Key: SPARK-50005
> URL: https://issues.apache.org/jira/browse/SPARK-50005
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.8, 3.5.3
> Reporter: loong
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]