Patrick Wendell created SPARK-4073:
--------------------------------------

             Summary: Parquet+Snappy can cause significant off-heap memory usage
                 Key: SPARK-4073
                 URL: https://issues.apache.org/jira/browse/SPARK-4073
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.1.0
            Reporter: Patrick Wendell
            Priority: Critical


The parquet snappy codec allocates off-heap buffers for decompression[1]. In 
one cases the observed size of these buffers was high enough to add several GB 
of data to the overall virtual memory usage of the Spark executor process. I 
don't understand enough about our use of Snappy to fully grok how much data we 
would _expect_ to be present in these buffers at any given time, but I can say 
a few things.

1. The dataset had individual rows that were fairly large, e.g. megabytes.
2. Direct buffers are not cleaned up until GC events, and overall there was not 
much heap contention. So maybe they just weren't being cleaned.

I opened PARQUET-118 to see if they can provide an option to use on-heap 
buffers for decompression. In the mean time, we could consider changing the 
default back to gzip, or we could do nothing (not sure how many other users 
will hit this).

[1] 
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to