[GitHub] [orc] LouisClt opened a new issue, #1240: Huge memory taken for each field when exporting

GitBox Tue, 30 Aug 2022 08:16:16 -0700


LouisClt opened a new issue, #1240:
URL: https://github.com/apache/orc/issues/1240


   
   Hello,
   Using arrow adapter, I became aware that the memory (RAM)  footprint of the 
export (exporting an orc file) was very huge for each field. For instance, 
exporting a table with 10000 fields can take up to 30Go, even if there is only 
10 records.
   Even for 100 fields, that could take 100Mo+.
   The "issue" seems to be coming from here :
   
https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/ColumnWriter.cc#L59
   
   When we create a writer with the "createWriter" 
(https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/Writer.cc#L681-L684
 ), a stream (compressor) is created for each field. As we allocate a Buffer of 
1 * 1024 *1024 we get as a minimum 1Mo additionnal size taken in memory for 
each field. 
   
   Is there a reason the BufferedOutputStream initial capacity is that high ? I 
circumvented my problem by lowering it to 1Ko (it didn't change much the 
performance according to my testing, but it may depend on usecases). Could it 
be envisaged to put a global variable (or static one) to parametrize this to 
allow changing this hard coded parameter ?
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] LouisClt opened a new issue, #1240: Huge memory taken for each field when exporting

Reply via email to