hi folks, We (Twitter) are currently on a much older version of Parquet - SHA: b1ea059 <https://github.com/apache/parquet-mr/commit/b1ea059a66c7d6d6bb4cb53d2005a9b7bb599ada>. We wanted to update our internal version to the current Parquet head and we've started to see some performance issues. This includes the changes to use ByteBuffer in read & write paths <https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8> and the fix for those changes not working on Hadoop / S3 <https://github.com/apache/parquet-mr/pull/346>.
With these changes we've seen our Hadoop Parquet read jobs perform 4-6% worse (looking at the MB_Millis metric) and 10-20% worse (MB_Millis) for Hadoop Parquet write jobs. When we tested out a branch with these changes reverted, we noticed that reads perform pretty much at par and writes are just slightly slower (improve by 3-10%). Curious about the context on these changes so I was hoping someone might know / be able to chime in: 1) Was there any performance testing / benchmarking done on these changes? Has any one else encountered these performance issues? 2) Any idea if there are any specific flags / such we need to set for these changes? (didn't see any based on a quick look through of the code but I might have missed something). 3) Was there a specific feature that these changes were supposed to enable? Wondering if backing them out of head is an option that folks would be open to? I'm planning to try and dig into what specifically in these changes is causing the slow down. Will update this thread based on my findings. We might also end up forking Parquet internally to unblock ourselves but we'd like to stay on master if we can.. Thanks, -- - Piyush
