[ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652881#comment-15652881
 ] 

Nicholas Chammas commented on SPARK-18367:
------------------------------------------

To provide some context, this code base I'm struggling with, which is 
triggering the above behavior, is a rewrite of a legacy application. The legacy 
application mostly uses RDDs and runs on 1.6. The new and shiny is 
DataFrame-only and runs on 2.0.

I ran the legacy app on the same dataset and saw Spark use less than 2K files. 
So either due to the different APIs, or due to the different Spark version, or 
due to some combination thereof, there is a large difference in the number of 
files Spark uses between the legacy app and the rewrite that I'm currently 
struggling with.

[~hvanhovell] - Can you give me a general sense of whether the open file count 
is more likely to just be my fault (from poor API usage, for example) or if 
it's a sign that perhaps Spark is doing something wrong? Is there anything you 
think I should look into?

> limit() makes the lame walk again
> ---------------------------------
>
>                 Key: SPARK-18367
>                 URL: https://issues.apache.org/jira/browse/SPARK-18367
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1, 2.1.0
>         Environment: Python 3.5, Java 8
>            Reporter: Nicholas Chammas
>         Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <    :                          :     :     +- *GlobalLimit 1000000
> <    :                          :     :        +- Exchange SinglePartition
> <    :                          :     :           +- *LocalLimit 1000000
> <    :                          :     :              +- *Project [...]
> <    :                          :     :                 +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >    :                          :     :     +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <                               :     :     +- *GlobalLimit 1000000
> <                               :     :        +- Exchange SinglePartition
> <                               :     :           +- *LocalLimit 1000000
> <                               :     :              +- *Project [...]
> <                               :     :                 +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >                               :     :     +- *Scan orc [] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 
> 1000000 limit you can see in the plan is much higher than the cardinality of 
> the dataset I'm reading, so there is no theoretical impact on the output. You 
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can 
> work towards one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit 
> {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to