GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/2187
[SPARK-3277] Fix external spilling with LZ4 assertion error
The bulk of this PR is comprised of tests and documentation; the actual fix
is really just adding 1 line of code (see `BlockObjectWriter.scala`). We
currently do not run the `External*` test suites with different compression
codecs, and this would have caught the bug reported in SPARK-3277. This PR
extends the existing code to test spilling using all compression codecs known
to Spark, including `LZ4`.
**The actual bug**
In `DiskBlockObjectWriter`, we only report the shuffle bytes written before
we close the streams. With `LZ4`, all the bytes written in our metrics were 0
because `flush()` was not taking effect for some reason. In general,
compression codecs may write additional bytes to the file after we call
`close()`, and so we must also capture those bytes in our shuffle write metrics.
Thanks @mridulm and @pwendell for help with debugging.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark fix-lz4-spilling
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2187.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2187
----
commit 1bfa7438f8cb5ef00842e67f343dd02eb9679c8a
Author: Andrew Or <[email protected]>
Date: 2014-08-28T20:15:32Z
Add more information to assert for better debugging
commit b264a84bcd848609f26ba80f23391c439369180e
Author: Andrew Or <[email protected]>
Date: 2014-08-28T20:59:07Z
ExternalAppendOnlyMapSuite code style fixes (minor)
commit 4bbcf68cad63d6f787a445e2eb724fcd6e6acb7c
Author: Andrew Or <[email protected]>
Date: 2014-08-28T21:38:25Z
Update tests to actually test all compression codecs
commit 089593f6d388980694f031641091b58e1d8dcfc7
Author: Andrew Or <[email protected]>
Date: 2014-08-28T21:57:59Z
Actually fix SPARK-3277 (tests still fail)
commit a1ad53620d209837cc456e5789c7b73d7e1b8b80
Author: Andrew Or <[email protected]>
Date: 2014-08-28T22:42:11Z
Fix tests
We need to stop the SparkContexts before creating a new one.
Otherwise the tests get into bad states.
commit 92e251bae0f354cfe8350497d2ca0bb2bdd8028b
Author: Patrick Wendell <[email protected]>
Date: 2014-08-28T20:54:01Z
Better documentation for BlockObjectWriter.
commit 6b2e7d155457043b967e03743c0d22556d818a3e
Author: Andrew Or <[email protected]>
Date: 2014-08-28T22:45:26Z
Fix compilation error
commit 1c4624ed0d351375d4ff3bcb6384a27fe2b98fd5
Author: Andrew Or <[email protected]>
Date: 2014-08-28T22:45:48Z
Merge branch 'master' of github.com:apache/spark into fix-lz4-spilling
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]