GitHub user ericl opened a pull request:

    https://github.com/apache/spark/pull/13537

    [SPARK-15794] Should truncate toString() of very wide schemas

    ## What changes were proposed in this pull request?
    
    With very wide tables, e.g. thousands of fields, the output is unreadable 
and often causes OOMs due to inefficient string processing. This truncates all 
struct and operator field lists to a user configurable threshold to limit 
performance and readability impact.
    
    It would also be nice to optimize string generation to avoid these sort of 
O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including 
expressions), but this is probably too large of a change for 2.0 at this point.
    
    ## How was this patch tested?
    
    Added a microbenchmark that covers this case particularly well. I also ran 
the microbenchmark while varying the truncation threshold.
    
    ```
    numFields = 5
    wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    2000 wide x 50 rows (write in-mem)            2336 / 2558          0.0      
 23364.4       0.1X
    
    numFields = 25
    wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    2000 wide x 50 rows (write in-mem)            4237 / 4465          0.0      
 42367.9       0.1X
    
    numFields = 100
    wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    2000 wide x 50 rows (write in-mem)          10458 / 11223          0.0      
104582.0       0.0X
    
    numFields = Infinity
    wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    [info]   java.lang.OutOfMemoryError: Java heap space
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericl/spark truncated-string

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13537.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13537
    
----
commit d16e0f3e22287a7f3779ed24239d84179602e30a
Author: Eric Liang <[email protected]>
Date:   2016-06-07T00:56:06Z

    truncate strings

commit f4f4368d3550b864c6286ce04770990b41c6741c
Author: Eric Liang <[email protected]>
Date:   2016-06-07T01:37:13Z

    Mon Jun  6 18:37:13 PDT 2016

commit 17f98d76aec40bc7c6b8c46925d4013f9bccd639
Author: Eric Liang <[email protected]>
Date:   2016-06-07T01:43:24Z

    Mon Jun  6 18:43:24 PDT 2016

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to