[
https://issues.apache.org/jira/browse/PARQUET-156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manish Agarwal updated PARQUET-156:
-----------------------------------
Description:
I sent a mail on dev list but I seem to have a problem with email on dev list
so opening a bug here .
I am on a multithreaded system where there are M threads , each thread creating
an independent parquet writer and writing on the hdfs in its own
independent files . I have a finite amount of RAM say R .
Now when I created parquet writer using default block and page size i get heap
error (no memory ) on my set up . so I reduced my block size and page size to
very low and my system stopped giving me these out of memory errors and
started writing the file correctly . I am able to read these files correctly as
well .
I should not have to make the memory low and parquet should automatically make
sure i do not get these errors .
But in case i have to keep track of the memory my question is as follows.
Now keeping these values very less is not a recommended practice as i would
loose on performance . I am particularly concerned about write performance .
What math formula do you recommend that I should use to find correct
blockSize , pageSize to be passed to the parquet constructor to have the
right WRITE performance . ie how can i decide what should be the right
blockSize , pageSize for a parquet writer given that i have M threads and
total RAM memory available is R . I don't understand dictionaryPageSize need
and in case i need to bother about that as well kindly let me know but i
have kept enableDictionary flag as false .
I am using the bellow constructor .
public More ...ParquetWriter(
162 Path file,
163 WriteSupport<T> writeSupport,
164 CompressionCodecName compressionCodecName,
165 int blockSize,
166 int pageSize,
167 int dictionaryPageSize,
168 boolean enableDictionary,
169 boolean validating,
170 WriterVersion writerVersion,
171 Configuration conf) throws IOException {
was:
I sent a mail on dev list but I seem to have a problem with email on dev list
so opening a bug here .
I am on a multithreaded system where there are M threads , each thread creating
an independent parquet writer and writing on the hdfs in its own
independent files . I have a finite amount of RAM say R .
Now when I created parquet writer using default block and page size i get heap
error (no memory ) on my set up . so I reduced my block size and page size to
very low and my system stopped giving me these out of memory errors and
started writing the file correctly . I am able to read these files correctly as
well .
I should not have to make the memory low and parquet should automatically make
sure i do not get these errors .
But in case i have to keep track of the memory my question is this .
Now keeping these values very less is not a recommended practice as i would
loose on performance . I am particularly concerned about write performance .
What math formula do you recommend that I should use to find correct
blockSize , pageSize to be passed to the parquet constructor to have the
right WRITE performance . ie how can i decide what should be the right
blockSize , pageSize for a parquet writer given that i have M threads and
total RAM memory available is R . I don't understand dictionaryPageSize need
and in case i need to bother about that as well kindly let me know but i
have kept enableDictionary flag as false .
I am using the bellow constructor .
public More ...ParquetWriter(
162 Path file,
163 WriteSupport<T> writeSupport,
164 CompressionCodecName compressionCodecName,
165 int blockSize,
166 int pageSize,
167 int dictionaryPageSize,
168 boolean enableDictionary,
169 boolean validating,
170 WriterVersion writerVersion,
171 Configuration conf) throws IOException {
> parquet should have automatic memory management to avoid out of memory error.
> ------------------------------------------------------------------------------
>
> Key: PARQUET-156
> URL: https://issues.apache.org/jira/browse/PARQUET-156
> Project: Parquet
> Issue Type: Improvement
> Reporter: Manish Agarwal
>
> I sent a mail on dev list but I seem to have a problem with email on dev list
> so opening a bug here .
> I am on a multithreaded system where there are M threads , each thread
> creating an independent parquet writer and writing on the hdfs in its
> own independent files . I have a finite amount of RAM say R .
> Now when I created parquet writer using default block and page size i get
> heap error (no memory ) on my set up . so I reduced my block size and page
> size to very low and my system stopped giving me these out of memory errors
> and started writing the file correctly . I am able to read these files
> correctly as well .
> I should not have to make the memory low and parquet should automatically
> make sure i do not get these errors .
> But in case i have to keep track of the memory my question is as follows.
> Now keeping these values very less is not a recommended practice as i would
> loose on performance . I am particularly concerned about write performance .
> What math formula do you recommend that I should use to find correct
> blockSize , pageSize to be passed to the parquet constructor to have the
> right WRITE performance . ie how can i decide what should be the right
> blockSize , pageSize for a parquet writer given that i have M threads and
> total RAM memory available is R . I don't understand dictionaryPageSize
> need and in case i need to bother about that as well kindly let me know
> but i have kept enableDictionary flag as false .
> I am using the bellow constructor .
> public More ...ParquetWriter(
> 162 Path file,
> 163 WriteSupport<T> writeSupport,
> 164 CompressionCodecName compressionCodecName,
> 165 int blockSize,
> 166 int pageSize,
> 167 int dictionaryPageSize,
> 168 boolean enableDictionary,
> 169 boolean validating,
> 170 WriterVersion writerVersion,
> 171 Configuration conf) throws IOException {
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)