GitHub user ericl opened a pull request:
https://github.com/apache/spark/pull/13537
[SPARK-15794] Should truncate toString() of very wide schemas
## What changes were proposed in this pull request?
With very wide tables, e.g. thousands of fields, the output is unreadable
and often causes OOMs due to inefficient string processing. This truncates all
struct and operator field lists to a user configurable threshold to limit
performance and readability impact.
It would also be nice to optimize string generation to avoid these sort of
O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including
expressions), but this is probably too large of a change for 2.0 at this point.
## How was this patch tested?
Added a microbenchmark that covers this case particularly well. I also ran
the microbenchmark while varying the truncation threshold.
```
numFields = 5
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 2336 / 2558 0.0
23364.4 0.1X
numFields = 25
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 4237 / 4465 0.0
42367.9 0.1X
numFields = 100
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0
104582.0 0.0X
numFields = Infinity
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
[info] java.lang.OutOfMemoryError: Java heap space
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ericl/spark truncated-string
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13537.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13537
----
commit d16e0f3e22287a7f3779ed24239d84179602e30a
Author: Eric Liang <[email protected]>
Date: 2016-06-07T00:56:06Z
truncate strings
commit f4f4368d3550b864c6286ce04770990b41c6741c
Author: Eric Liang <[email protected]>
Date: 2016-06-07T01:37:13Z
Mon Jun 6 18:37:13 PDT 2016
commit 17f98d76aec40bc7c6b8c46925d4013f9bccd639
Author: Eric Liang <[email protected]>
Date: 2016-06-07T01:43:24Z
Mon Jun 6 18:43:24 PDT 2016
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]