Hi Piyush, I don't think there was a performance eval done before we merged the ByteBuffer code. Given how long it has taken to get in, I think it's reasonable to get it in and then assess and fix performance before we release. Sounds like you guys have done that preliminary assessment, so thanks! Bummer that it's not faster, but we do want to be on these changes in the long run to be able to use the zero-copy API and enable off-heap buffers.
I suspect that the problem is buffer management. There are a few places where we end up making copies if things don't line up properly. For example, when the getBytes methods are called on Binary objects, if the beginning isn't at the start of a single byte[], then it will copy. The good news is that we have the Binary interface so we should be able to easily instrument where copies are happening and log to track this down. I think that's a promising start for finding out what's causing this slow-down. rb On Mon, Jun 13, 2016 at 1:47 PM, Piyush Narang <[email protected]> wrote: > hi folks, > > We (Twitter) are currently on a much older version of Parquet - SHA: > b1ea059 > < > https://github.com/apache/parquet-mr/commit/b1ea059a66c7d6d6bb4cb53d2005a9b7bb599ada > >. > We wanted to update our internal version to the current Parquet head and > we've started to see some performance issues. This includes the changes to > use ByteBuffer in read & write paths > < > https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 > > > and > the fix for those changes not working on Hadoop / S3 > <https://github.com/apache/parquet-mr/pull/346>. > > With these changes we've seen our Hadoop Parquet read jobs perform 4-6% > worse (looking at the MB_Millis metric) and 10-20% worse (MB_Millis) for > Hadoop Parquet write jobs. When we tested out a branch with these changes > reverted, we noticed that reads perform pretty much at par and writes are > just slightly slower (improve by 3-10%). > > Curious about the context on these changes so I was hoping someone might > know / be able to chime in: > 1) Was there any performance testing / benchmarking done on these changes? > Has any one else encountered these performance issues? > 2) Any idea if there are any specific flags / such we need to set for these > changes? (didn't see any based on a quick look through of the code but I > might have missed something). > 3) Was there a specific feature that these changes were supposed to enable? > Wondering if backing them out of head is an option that folks would be open > to? > > I'm planning to try and dig into what specifically in these changes is > causing the slow down. Will update this thread based on my findings. We > might also end up forking Parquet internally to unblock ourselves but we'd > like to stay on master if we can.. > > Thanks, > -- > - Piyush > -- Ryan Blue Software Engineer Netflix
