[I] High Memory Usage and Long GC Times When Writing Parquet Files [parquet-java]

via GitHub Tue, 10 Dec 2024 06:37:45 -0800


ccl125 opened a new issue, #3102:
URL: https://github.com/apache/parquet-java/issues/3102


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   In my project, I am using the following code to write Parquet files to the 
server:
   `ParquetWriter<Group> parquetWriter = ExampleParquetWriter.builder(new 
Path(filePath))
       .withConf(new Configuration())
       .withType(messageType)
       .build();`
   
   Each Parquet file contains 30000 columns. This code is executed by multiple 
threads simultaneously, which results in increased GC time. Upon analyzing 
memory usage, I found that the main memory consumers are related to the 
following chain:
   
   InternalParquetRecordWriter -> ColumnWriterV1 -> FallbackValuesWriter -> 
PlainDoubleDictionaryValuesWriter -> IntList
   
   Each thread writes to a file with the same table schema (header), differing 
only in the filePath.
   
   I initially suspected that the memory usage was caused by the file buffer 
not being flushed in time. To address this, I tried configuring the writer with 
the following parameters:
   
   `parquetWriter = ExampleParquetWriter.builder(new Path(filePath))
       .withConf(new Configuration())
       .withType(messageType)
       
.withMinRowCountForPageSizeCheck(SpringContextUtils.getApplicationContext()
           .getBean(EtlTaskProperties.class).getMinRowCountForPageSizeCheck())
       
.withMaxRowCountForPageSizeCheck(SpringContextUtils.getApplicationContext()
           .getBean(EtlTaskProperties.class).getMaxRowCountForPageSizeCheck())
       .withRowGroupSize(SpringContextUtils.getApplicationContext()
           .getBean(EtlTaskProperties.class).getRowGroupSize())
       .build();`
   
   However, these adjustments did not solve the issue. The program still 
experiences long GC pauses and excessive memory usage.
       
   Expected Behavior
   
   Efficient Parquet file writing with reduced GC time and optimized memory 
usage when multiple threads are writing files simultaneously.
   
   Observed Behavior
        •       Increased GC time and excessive memory usage.
        •       Memory analysis indicates IntList under 
PlainDoubleDictionaryValuesWriter is the primary consumer of memory.
   
   Request
   
   What are the recommended strategies to mitigate excessive memory usage in 
this scenario?
   Is there a way to share table schema-related objects across threads, or 
other optimizations to reduce memory overhead?
   
   Please let me know if additional information is needed!
       
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] High Memory Usage and Long GC Times When Writing Parquet Files [parquet-java]

Reply via email to