LouisClt opened a new issue, #1240: URL: https://github.com/apache/orc/issues/1240
Hello, Using arrow adapter, I became aware that the memory (RAM) footprint of the export (exporting an orc file) was very huge for each field. For instance, exporting a table with 10000 fields can take up to 30Go, even if there is only 10 records. Even for 100 fields, that could take 100Mo+. The "issue" seems to be coming from here : https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/ColumnWriter.cc#L59 When we create a writer with the "createWriter" (https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/Writer.cc#L681-L684 ), a stream (compressor) is created for each field. As we allocate a Buffer of 1 * 1024 *1024 we get as a minimum 1Mo additionnal size taken in memory for each field. Is there a reason the BufferedOutputStream initial capacity is that high ? I circumvented my problem by lowering it to 1Ko (it didn't change much the performance according to my testing, but it may depend on usecases). Could it be envisaged to put a global variable (or static one) to parametrize this to allow changing this hard coded parameter ? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
