[jira] [Comment Edited] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression
[ https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928131#comment-16928131 ] Mitesh edited comment on PARQUET-118 at 9/12/19 1:43 AM: - Any update on this? I am hitting this with Spark when a column is a struct that is very deep. Seems like the entire thing gets buffered at one time, so I have to set {{-XX:MaxDirectMemorySize}} to a very large number (biggest size of column * num rows in partition * num partitions processed by a single JVM). It would be great to have a config to force on-heap buffer usage, even if there is a latency hit. Netty provides this functionality via {{-Dio.netty.noUnsafe}} flag and I think it was a wise decision by them. cc [~nongli] was (Author: masterddt): Any update on this? I am hitting this with Spark when a column is a struct that is very deep. Seems like the entire thing gets buffered at one time, so I have to set {{-XX:MaxDirectMemorySize}} to a very large number (biggest size of column * num rows in partition * num partitions processed by a single JVM). It would be great to have a config to force on-heap buffer usage, even if there is a latency hit. Netty provides this functionality via {{-Dio.netty.noUnsafe}} flag and I think it was a wise decision by them. > Provide option to use on-heap buffers for Snappy compression/decompression > -- > > Key: PARQUET-118 > URL: https://issues.apache.org/jira/browse/PARQUET-118 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Patrick Wendell >Priority: Major > > The current code uses direct off-heap buffers for decompression. If many > decompressors are instantiated across multiple threads, and/or the objects > being decompressed are large, this can lead to a huge amount of off-heap > allocation by the JVM. This can be exacerbated if overall, there is not heap > contention, since no GC will be performed to reclaim the space used by these > buffers. > It would be nice if there was a flag we cold use to simply allocate on-heap > buffers here: > https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28 > We ran into an issue today where these buffers totaled a very large amount of > storage and caused our Java processes (running within containers) to be > terminated by the kernel OOM-killer. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression
[ https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928131#comment-16928131 ] Mitesh commented on PARQUET-118: Any update on this? I am hitting this with Spark when a column is a struct that is very deep. Seems like the entire thing gets buffered at one time, so I have to set {{-XX:MaxDirectMemorySize}} to a very large number (biggest size of column * num rows in partition * num partitions processed by a single JVM). It would be great to have a config to force on-heap buffer usage, even if there is a latency hit. Netty provides this functionality via {{-Dio.netty.noUnsafe}} flag and I think it was a wise decision by them. > Provide option to use on-heap buffers for Snappy compression/decompression > -- > > Key: PARQUET-118 > URL: https://issues.apache.org/jira/browse/PARQUET-118 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Patrick Wendell >Priority: Major > > The current code uses direct off-heap buffers for decompression. If many > decompressors are instantiated across multiple threads, and/or the objects > being decompressed are large, this can lead to a huge amount of off-heap > allocation by the JVM. This can be exacerbated if overall, there is not heap > contention, since no GC will be performed to reclaim the space used by these > buffers. > It would be nice if there was a flag we cold use to simply allocate on-heap > buffers here: > https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28 > We ran into an issue today where these buffers totaled a very large amount of > storage and caused our Java processes (running within containers) to be > terminated by the kernel OOM-killer. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (PARQUET-1178) Parquet modular encryption
[ https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1178: -- Fix Version/s: format-2.7.0 > Parquet modular encryption > -- > > Key: PARQUET-1178 > URL: https://issues.apache.org/jira/browse/PARQUET-1178 > Project: Parquet > Issue Type: New Feature >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: format-2.7.0 > > > A mechanism for modular encryption and decryption of Parquet files. Allows to > keep data fully encrypted in the storage - while enabling efficient analytics > on the data, via reader-side extraction / authentication / decryption of data > subsets required by columnar projection and predicate push-down. > Enables fine-grained access control to column data by encrypting different > columns with different keys. > Supports a number of encryption algorithms, to account for different security > and performance requirements. -- This message was sent by Atlassian Jira (v8.3.2#803003)