GitHub user tdas opened a pull request:
https://github.com/apache/spark/pull/12591
[SPARK-14832][SQL][STREAMING]
## What changes were proposed in this pull request?
When creating a file stream using sqlContext.write.stream(), existing files
are scanned twice for finding the schema
- Once, when creating a DataSource + StreamingRelation in the
DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in
DataSource.createSource()
Instead, the schema should be generated only once, at the time of creating
the dataframe, and when the streaming source is created, it should just reuse
that schema
The solution proposed in this PR is to add a lazy field in DataSource that
caches the schema. Then streaming Source created by the DataSource can just
reuse the schema.
In addition to the bug fix, there is additional refactoring of tests in
this PR to fix the following problem.
Current StreamTest allows testing of a streaming Dataset generated
explicitly wraps a source. This is different from the actual production code
path where the source object is dynamically created through a DataSource object
every time a query is started. So all the fault-tolerance testing in
FileSourceSuite and FileSourceStressSuite is not really testing the actual code
path as they are just reusing the FileStreamSource object. This PR fixes
StreamTest and the FileSource***Suite to test this correctly. Instead of
maintaining a mapping of source --> expected offset in StreamTest (which
requires reuse of source object), it now maintains a mapping of source index
--> offset, so that it is independent of the source object.
## How was this patch tested?
Refactored unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tdas/spark SPARK-14832
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12591.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12591
----
commit f9a1524ff16e5a92ad0da56c0d2fab36044e2641
Author: Tathagata Das <[email protected]>
Date: 2016-04-21T01:57:03Z
Added unit test
commit e9c6d607ca8de4bd89f78ede4839ca54c0be6b1c
Author: Tathagata Das <[email protected]>
Date: 2016-04-21T20:21:04Z
Refactored file source to avoid multiple schema inferences
commit d7efacb959667b260d110ab58251014f29fc0b4b
Author: Tathagata Das <[email protected]>
Date: 2016-04-22T00:33:06Z
Refactored test code for better testing
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]