GitHub user windpiger reopened a pull request:
https://github.com/apache/spark/pull/17081
[SPARK-18726][SQL][FOLLOW-UP]resolveRelation for FileFormat DataSource
don't need to listFiles twice
## What changes were proposed in this pull request?
Currently when we resolveRelation for a `FileFormat DataSource` without
providing user schema, it will execute `listFiles` twice in
`InMemoryFileIndex` during `resolveRelation`.
This PR add a `FileStatusCache` for DataSource, this can avoid listFiles
twice.
But there is a bug in `InMemoryFileIndex` see:
[SPARK-19748](https://github.com/apache/spark/pull/17079)
[SPARK-19761](https://github.com/apache/spark/pull/17093),
so this pr should be after SPARK-19748/ SPARK-19761.
## How was this patch tested?
unit test added
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/windpiger/spark
resolveDataSourceScanFilesTwice
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17081.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17081
----
commit 0082b7633e8f84fe5cafa0362cd45cce4cfee459
Author: windpiger <[email protected]>
Date: 2017-02-27T08:04:30Z
[SPAKR-18726][SQL]resolveRelation for FileFormate DataSource don't need to
listFiles twice
commit 6b5454ad0104459565febb520fa22ef30bdb8368
Author: windpiger <[email protected]>
Date: 2017-02-27T08:39:45Z
add test case
commit f1da0a4cf457f4efb6128beca3c08ccf95ef37a0
Author: windpiger <[email protected]>
Date: 2017-02-27T23:59:34Z
fix a style
commit f79f12c552ee1721295c347744fc5f92f048c74b
Author: windpiger <[email protected]>
Date: 2017-03-01T22:49:13Z
Merge branch 'master' into resolveDataSourceScanFilesTwice
commit a8c1deab0fc8e59863bf4a3d3b551f77fbebbc6d
Author: windpiger <[email protected]>
Date: 2017-03-02T01:50:30Z
fix test failed
commit 60fa03757d223f833e2fa161326a48a9015d4c6c
Author: windpiger <[email protected]>
Date: 2017-03-02T04:49:08Z
add a lazy
commit 9a73947efea334ba0cfc5b5508003807a93ff806
Author: windpiger <[email protected]>
Date: 2017-03-02T06:49:44Z
fix code style
commit 850094cd3b77f6ecf33caf88532920e73de976f4
Author: windpiger <[email protected]>
Date: 2017-03-02T06:54:38Z
Merge branch 'master' of github.com:apache/spark into
resolveDataSourceScanFilesTwice
commit c39eb26da38f9d92e3871814be446c8d911be890
Author: windpiger <[email protected]>
Date: 2017-03-02T11:03:18Z
make filestatuscache local var
commit f3332cb870ae2be9383969de07a07c8761230e8b
Author: windpiger <[email protected]>
Date: 2017-03-02T11:04:55Z
modify a test case
commit 9cadd4168041fd859cc1e4b8396e5ed514129bff
Author: windpiger <[email protected]>
Date: 2017-03-02T11:05:24Z
modify a test case
commit 28c8158a7c9d7acdbf2a07ef66ace46c1215979f
Author: windpiger <[email protected]>
Date: 2017-03-02T11:06:40Z
modify a test case
commit 92618b3ad67c899e681a9923ad9abc5a7f2c7897
Author: windpiger <[email protected]>
Date: 2017-03-02T11:07:10Z
remove an empty line
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]