[
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652968#comment-15652968
]
Nicholas Chammas commented on SPARK-18367:
------------------------------------------
Even if I cut the number of records I'm processing with this new code base to
literally 10-20 records, I'm still seeing Spark use north of 6K files. This
smells like a resource leak or a bug.
I will try to boil this down to a minimal repro that I can share.
> limit() makes the lame walk again
> ---------------------------------
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
> Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial
> writes to file
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException:
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on
> macOS. However, I don't think that's the issue, since if I add a dummy
> {{limit()}} early on the query tree -- dummy as in it does not actually
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> < : : : +- *GlobalLimit 1000000
> < : : : +- Exchange SinglePartition
> < : : : +- *LocalLimit 1000000
> < : : : +- *Project [...]
> < : : : +- *Scan orc [...]
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [],
> ReadSchema: struct<...
> ---
> > : : : +- *Scan orc [...] Format: ORC,
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema:
> > struct<...
> 49,53c45
> < : : +- *GlobalLimit 1000000
> < : : +- Exchange SinglePartition
> < : : +- *LocalLimit 1000000
> < : : +- *Project [...]
> < : : +- *Scan orc [...]
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [],
> ReadSchema: struct<...
> ---
> > : : +- *Scan orc [] Format: ORC,
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema:
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the
> 1000000 limit you can see in the plan is much higher than the cardinality of
> the dataset I'm reading, so there is no theoretical impact on the output. You
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can
> work towards one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit
> {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]