[jira] [Comment Edited] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression

2019-09-11 Thread Mitesh (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928131#comment-16928131
 ] 

Mitesh edited comment on PARQUET-118 at 9/12/19 1:43 AM:
-

Any update on this? I am hitting this with Spark when a column is a struct that 
is very deep. Seems like the entire thing gets buffered at one time, so I have 
to set {{-XX:MaxDirectMemorySize}} to a very large number (biggest size of 
column * num rows in partition * num partitions processed by a single JVM).

It would be great to have a config to force on-heap buffer usage, even if there 
is a latency hit. Netty provides this functionality via {{-Dio.netty.noUnsafe}} 
flag and I think it was a wise decision by them.

cc [~nongli]


was (Author: masterddt):
Any update on this? I am hitting this with Spark when a column is a struct that 
is very deep. Seems like the entire thing gets buffered at one time, so I have 
to set {{-XX:MaxDirectMemorySize}} to a very large number (biggest size of 
column * num rows in partition * num partitions processed by a single JVM).

It would be great to have a config to force on-heap buffer usage, even if there 
is a latency hit. Netty provides this functionality via {{-Dio.netty.noUnsafe}} 
flag and I think it was a wise decision by them.

> Provide option to use on-heap buffers for Snappy compression/decompression
> --
>
> Key: PARQUET-118
> URL: https://issues.apache.org/jira/browse/PARQUET-118
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Patrick Wendell
>Priority: Major
>
> The current code uses direct off-heap buffers for decompression. If many 
> decompressors are instantiated across multiple threads, and/or the objects 
> being decompressed are large, this can lead to a huge amount of off-heap 
> allocation by the JVM. This can be exacerbated if overall, there is not heap 
> contention, since no GC will be performed to reclaim the space used by these 
> buffers.
> It would be nice if there was a flag we cold use to simply allocate on-heap 
> buffers here:
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28
> We ran into an issue today where these buffers totaled a very large amount of 
> storage and caused our Java processes (running within containers) to be 
> terminated by the kernel OOM-killer.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression

2019-09-11 Thread Mitesh (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928131#comment-16928131
 ] 

Mitesh commented on PARQUET-118:


Any update on this? I am hitting this with Spark when a column is a struct that 
is very deep. Seems like the entire thing gets buffered at one time, so I have 
to set {{-XX:MaxDirectMemorySize}} to a very large number (biggest size of 
column * num rows in partition * num partitions processed by a single JVM).

It would be great to have a config to force on-heap buffer usage, even if there 
is a latency hit. Netty provides this functionality via {{-Dio.netty.noUnsafe}} 
flag and I think it was a wise decision by them.

> Provide option to use on-heap buffers for Snappy compression/decompression
> --
>
> Key: PARQUET-118
> URL: https://issues.apache.org/jira/browse/PARQUET-118
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Patrick Wendell
>Priority: Major
>
> The current code uses direct off-heap buffers for decompression. If many 
> decompressors are instantiated across multiple threads, and/or the objects 
> being decompressed are large, this can lead to a huge amount of off-heap 
> allocation by the JVM. This can be exacerbated if overall, there is not heap 
> contention, since no GC will be performed to reclaim the space used by these 
> buffers.
> It would be nice if there was a flag we cold use to simply allocate on-heap 
> buffers here:
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28
> We ran into an issue today where these buffers totaled a very large amount of 
> storage and caused our Java processes (running within containers) to be 
> terminated by the kernel OOM-killer.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (PARQUET-1178) Parquet modular encryption

2019-09-11 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1178:
--
Fix Version/s: format-2.7.0

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: format-2.7.0
>
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)