Steve Loughran created SPARK-17159:
--------------------------------------
Summary: Improve FileInputDStream.findNewFiles list performance
Key: SPARK-17159
URL: https://issues.apache.org/jira/browse/SPARK-17159
Project: Spark
Issue Type: Improvement
Components: Streaming
Affects Versions: 2.0.0
Environment: spark against object stores
Reporter: Steve Loughran
Priority: Minor
{{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that
calls getFileStatus() on every file, takes the output and does listStatus() on
the output.
This going to suffer on object stores, as dir listing and getFileStatus calls
are so expensive. It's clear this is a problem, as the method has code to
detect timeouts in the window and warn of problems.
It should be possible to make this faster
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]