Noritaka Sekiyama created SPARK-46981:
-----------------------------------------

             Summary: Driver OOM happens in query planning phase with empty 
tables
                 Key: SPARK-46981
                 URL: https://issues.apache.org/jira/browse/SPARK-46981
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.0
         Environment: * OSS Spark 3.5.0
 * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0)
 * AWS Glue Spark 3.3.0 (Glue version 4.0)
            Reporter: Noritaka Sekiyama
         Attachments: create_sanitized_tables.py

We have observed that Driver OOM happens in query planning phase with empty 
tables when we ran specific patterns of queries.
h2. Issue details

If we run the query with where condition {{pt>='20231004' and pt<='20231004', 
then the query fails in planning phase due to Driver OOM, more specifically, 
}}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit 
exceeded{}}}{{{}{}}}.

If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to 
{{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.

 

This issue happened even with empty table, and it happened before actual data 
load. This seems like an issue in catalyst side.
h2. Reproduction step

Attaching script and query to reproduce the issue.
 * create_sanitized_tables.py: Script to create table definitions
 * test_and_twodays_simplified.sql: Query to reproduce the issue

Here's the typical stacktrace:

{{  at scala.collection.immutable.Vector.iterator(Vector.scala:100)
    at scala.collection.immutable.Vector.iterator(Vector.scala:69)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
    at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
    at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
    at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
    at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
    at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
    at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)
    at scala.Option.getOrElse(Option.scala:189)
    at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
    at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
 Source)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
    at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
 Source)
    at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
    at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
    at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at scala.collection.immutable.Vector.iterator(Vector.scala:100)
    at scala.collection.immutable.Vector.iterator(Vector.scala:69)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
    at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
    at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
    at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
    at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
    at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
    at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)
    at scala.Option.getOrElse(Option.scala:189)
    at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
    at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
 Source)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
    at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
    at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
 Source)
    at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
    at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
    at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to