[ 
https://issues.apache.org/jira/browse/PARQUET-156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Agarwal updated PARQUET-156:
-----------------------------------
    Description: 
I sent a mail on dev list but I seem to have a problem with email on dev list  
so opening a bug here . 

I am on a multithreaded system where there are M threads , each thread creating 
  an   independent parquet writer  and writing on the hdfs  in its own 
independent files  . I have a  finite amount of RAM  say R  .  

Now when I created  parquet writer using default block and page size i get heap 
error (no memory )  on my set up  . so I reduced my block size and page size to 
very low and  my system stopped giving me these out of memory errors and 
started writing the file correctly . I am able to read these files correctly as 
well  . 

I should not have to make the memory low and parquet should automatically make 
sure i do not get these errors . 

But in case i have to keep track of the memory my question is as follows. 

Now keeping these  values very less is not a recommended practice as i would 
loose on performance . I am particularly concerned about  write performance .  
What math formula  do you recommend that I  should use to find correct 
blockSize , pageSize to be passed to the parquet constructor to have   the 
right  WRITE  performance  . ie how can i decide what should be the right 
blockSize , pageSize  for a parquet writer given that i have M threads and 
total RAM memory available is R   . I don't understand dictionaryPageSize need 
and in case i    need to bother about that as well kindly let me know but i 
have kept enableDictionary flag as false . 

I am using the bellow constructor .
public More ...ParquetWriter(
162      Path file,
163      WriteSupport<T> writeSupport,
164      CompressionCodecName compressionCodecName,
165      int blockSize,
166      int pageSize,
167      int dictionaryPageSize,
168      boolean enableDictionary,
169      boolean validating,
170      WriterVersion writerVersion,
171      Configuration conf) throws IOException {


  was:
I sent a mail on dev list but I seem to have a problem with email on dev list  
so opening a bug here . 

I am on a multithreaded system where there are M threads , each thread creating 
  an   independent parquet writer  and writing on the hdfs  in its own 
independent files  . I have a  finite amount of RAM  say R  .  

Now when I created  parquet writer using default block and page size i get heap 
error (no memory )  on my set up  . so I reduced my block size and page size to 
very low and  my system stopped giving me these out of memory errors and 
started writing the file correctly . I am able to read these files correctly as 
well  . 

I should not have to make the memory low and parquet should automatically make 
sure i do not get these errors . 

But in case i have to keep track of the memory my question is this . 

Now keeping these  values very less is not a recommended practice as i would 
loose on performance . I am particularly concerned about  write performance .  
What math formula  do you recommend that I  should use to find correct 
blockSize , pageSize to be passed to the parquet constructor to have   the 
right  WRITE  performance  . ie how can i decide what should be the right 
blockSize , pageSize  for a parquet writer given that i have M threads and 
total RAM memory available is R   . I don't understand dictionaryPageSize need 
and in case i    need to bother about that as well kindly let me know but i 
have kept enableDictionary flag as false . 

I am using the bellow constructor .
public More ...ParquetWriter(
162      Path file,
163      WriteSupport<T> writeSupport,
164      CompressionCodecName compressionCodecName,
165      int blockSize,
166      int pageSize,
167      int dictionaryPageSize,
168      boolean enableDictionary,
169      boolean validating,
170      WriterVersion writerVersion,
171      Configuration conf) throws IOException {



> parquet should have automatic memory management to avoid out of memory error. 
> ------------------------------------------------------------------------------
>
>                 Key: PARQUET-156
>                 URL: https://issues.apache.org/jira/browse/PARQUET-156
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Manish Agarwal
>
> I sent a mail on dev list but I seem to have a problem with email on dev list 
>  so opening a bug here . 
> I am on a multithreaded system where there are M threads , each thread 
> creating   an   independent parquet writer  and writing on the hdfs  in its 
> own independent files  . I have a  finite amount of RAM  say R  .  
> Now when I created  parquet writer using default block and page size i get 
> heap error (no memory )  on my set up  . so I reduced my block size and page 
> size to very low and  my system stopped giving me these out of memory errors 
> and started writing the file correctly . I am able to read these files 
> correctly as well  . 
> I should not have to make the memory low and parquet should automatically 
> make sure i do not get these errors . 
> But in case i have to keep track of the memory my question is as follows. 
> Now keeping these  values very less is not a recommended practice as i would 
> loose on performance . I am particularly concerned about  write performance . 
>  What math formula  do you recommend that I  should use to find correct 
> blockSize , pageSize to be passed to the parquet constructor to have   the 
> right  WRITE  performance  . ie how can i decide what should be the right 
> blockSize , pageSize  for a parquet writer given that i have M threads and 
> total RAM memory available is R   . I don't understand dictionaryPageSize 
> need and in case i    need to bother about that as well kindly let me know 
> but i have kept enableDictionary flag as false . 
> I am using the bellow constructor .
> public More ...ParquetWriter(
> 162      Path file,
> 163      WriteSupport<T> writeSupport,
> 164      CompressionCodecName compressionCodecName,
> 165      int blockSize,
> 166      int pageSize,
> 167      int dictionaryPageSize,
> 168      boolean enableDictionary,
> 169      boolean validating,
> 170      WriterVersion writerVersion,
> 171      Configuration conf) throws IOException {



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to