asfimport commented on issue #398:
URL: https://github.com/apache/parquet-format/issues/398#issuecomment-2184154068

   [Adam 
Binford](https://issues.apache.org/jira/browse/PARQUET-2160?#comment-17575830):
   > Which parquet version you're using? There are some fix 
patchs(<https://github.com/apache/parquet-mr/pull/903> and 
<https://github.com/apache/parquet-mr/pull/889>) released in 1.12.3.
   Yeah this is in Spark 3.3.0 so Parquet 1.12.2. It looks like 
<https://github.com/apache/parquet-mr/pull/889> made it into 1.12.2, so the 
buffer pool is the only main difference. I tried dropping in 1.12.3, and 
enabling the buffer pool in 1.12.2, and both still exhibit the same issue. The 
reason I can generate so much off heap usage (> 1GB in a few seconds), is 
because I have a very wide table (1k+ columns), that are mostly strings (not 
sure if that makes a difference), so it's probably creating a _lot_ of 
`{}ZstdInputStream{`}'s when reading all of the columns. Selecting only some of 
the columns isn't as noticeable, but still slowly grows over time.
   
   I compiled this suggested fix myself and tested it out and it did in fact 
completely fix my problem. What was generating GBs of off heap memory that 
never got cleaned up (and dozens of GB of virtual memory), now consistently 
stays around ~100MB. I also agree looking at `BytesInput` that no extra copy of 
the actual data is made using `{}BytesInput.copy{`}, because either way the 
data will be loaded into a single `byte[]` at some point, albeit a little 
earlier with the copy method. Only overhead is creating the additional 
`BytesInput` java object.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to