Nicholas Chammas created SPARK-18367:
----------------------------------------

             Summary: limit() makes the lame walk again
                 Key: SPARK-18367
                 URL: https://issues.apache.org/jira/browse/SPARK-18367
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.1, 2.1.0
         Environment: Python 3.5, Java 8
            Reporter: Nicholas Chammas


I have a complex DataFrame query that fails to run normally but succeeds if I 
add a dummy {{limit()}} upstream in the query tree.

The failure presents itself like this:

{code}
ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes 
to file 
/private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
java.io.FileNotFoundException: 
/private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
 (Too many open files in system)
{code}

My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
macOS. However, I don't think that's the issue, since if I add a dummy 
{{limit()}} early on the query tree -- dummy as in it does not actually reduce 
the number of rows queried -- then the same query works.

I've diffed the physical query plans to see what this {{limit()}} is actually 
doing, and the diff is as follows:

{code}
diff plan-with-limit.txt plan-without-limit.txt
24,28c24
<    :                          :     :     +- *GlobalLimit 1000000
<    :                          :     :        +- Exchange SinglePartition
<    :                          :     :           +- *LocalLimit 1000000
<    :                          :     :              +- *Project [...]
<    :                          :     :                 +- *Scan orc [...] 
Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<...
---
>    :                          :     :     +- *Scan orc [...] Format: ORC, 
> InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<...
49,53c45
<                               :     :     +- *GlobalLimit 1000000
<                               :     :        +- Exchange SinglePartition
<                               :     :           +- *LocalLimit 1000000
<                               :     :              +- *Project [...]
<                               :     :                 +- *Scan orc [...] 
Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<...
---
>                               :     :     +- *Scan orc [] Format: ORC, 
> InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<...
{code}

Does this give any clues as to why this {{limit()}} is helping? Again, the 
1000000 limit you can see in the plan is much higher than the cardinality of 
the dataset I'm reading, so there is no theoretical impact on the output. You 
can see the full query plans attached to this ticket.

Unfortunately, I don't have a minimal reproduction for this issue, but I can 
work towards one with some clues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to