[
https://issues.apache.org/jira/browse/PARQUET-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261258#comment-14261258
]
Ryan Blue commented on PARQUET-156:
-----------------------------------
I'm interested in what you find out on tuning, Manish. For this issue, I think
Brock is right that PARQUET-108 covers the "automatic memory management to
avoid OOM". Maybe you could update this to "document recommendations for block
size and page size given an expected number of writers"? I'm not sure that is
very valuable though. Really, you want your block size above a certain minimum,
which means you ideally wouldn't use a the memory manager. While a manager
keeps you from hitting OOM, it will degrade read performance if it makes the
parquet row group size (parquet block size) too small.
Here are some general ideas to follow:
1. Row group size should always be smaller than HDFS block size.
2. A multiple of the row group size should be approximately (maybe a little
less than) the HDFS block size. For example, 2 row groups might fit in a single
HDFS block.
3. Remember row group size is an indicator of the memory footprint of each open
file for both reading and writing. Reading will ideally be smaller, but this is
down to a constant factor from the columns you project.
4. Keep the expected number of open writers (M in your case) times the expected
consumption below your memory threshold (lower than total heap) and avoid
letting the memory manager do this for you.
> parquet should have automatic memory management to avoid out of memory error.
> ------------------------------------------------------------------------------
>
> Key: PARQUET-156
> URL: https://issues.apache.org/jira/browse/PARQUET-156
> Project: Parquet
> Issue Type: Improvement
> Reporter: Manish Agarwal
>
> I sent a mail on dev list but I seem to have a problem with email on dev list
> so opening a bug here .
> I am on a multithreaded system where there are M threads , each thread
> creating an independent parquet writer and writing on the hdfs in its
> own independent files . I have a finite amount of RAM say R .
> Now when I created parquet writer using default block and page size i get
> heap error (no memory ) on my set up . so I reduced my block size and page
> size to very low and my system stopped giving me these out of memory errors
> and started writing the file correctly . I am able to read these files
> correctly as well .
> I should not have to make the memory low and parquet should automatically
> make sure i do not get these errors .
> But in case i have to keep track of the memory my question is as follows.
> Now keeping these values very less is not a recommended practice as i would
> loose on performance . I am particularly concerned about write performance .
> What math formula do you recommend that I should use to find correct
> blockSize , pageSize to be passed to the parquet constructor to have the
> right WRITE performance . ie how can i decide what should be the right
> blockSize , pageSize for a parquet writer given that i have M threads and
> total RAM memory available is R . I don't understand dictionaryPageSize
> need and in case i need to bother about that as well kindly let me know
> but i have kept enableDictionary flag as false .
> I am using the bellow constructor .
> public More ...ParquetWriter(
> 162 Path file,
> 163 WriteSupport<T> writeSupport,
> 164 CompressionCodecName compressionCodecName,
> 165 int blockSize,
> 166 int pageSize,
> 167 int dictionaryPageSize,
> 168 boolean enableDictionary,
> 169 boolean validating,
> 170 WriterVersion writerVersion,
> 171 Configuration conf) throws IOException {
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)