hi folks,

We (Twitter) are currently on a much older version of Parquet - SHA: b1ea059
<https://github.com/apache/parquet-mr/commit/b1ea059a66c7d6d6bb4cb53d2005a9b7bb599ada>.
We wanted to update our internal version to the current Parquet head and
we've started to see some performance issues. This includes the changes to
use ByteBuffer in read & write paths
<https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8>
and
the fix for those changes not working on Hadoop / S3
<https://github.com/apache/parquet-mr/pull/346>.

With these changes we've seen our Hadoop Parquet read jobs perform 4-6%
worse (looking at the MB_Millis metric) and 10-20% worse (MB_Millis) for
Hadoop Parquet write jobs. When we tested out a branch with these changes
reverted, we noticed that reads perform pretty much at par and writes are
just slightly slower (improve by 3-10%).

Curious about the context on these changes so I was hoping someone might
know / be able to chime in:
1) Was there any performance testing / benchmarking done on these changes?
Has any one else encountered these performance issues?
2) Any idea if there are any specific flags / such we need to set for these
changes? (didn't see any based on a quick look through of the code but I
might have missed something).
3) Was there a specific feature that these changes were supposed to enable?
Wondering if backing them out of head is an option that folks would be open
to?

I'm planning to try and dig into what specifically in these changes is
causing the slow down. Will update this thread based on my findings. We
might also end up forking Parquet internally to unblock ourselves but we'd
like to stay on master if we can..

Thanks,
-- 
- Piyush

Reply via email to