hey Ryan, Thanks for getting back. Too bad we didn't run some perf tests prior to merging those changes in. Was quite a massive change so would've been nice to have some benchmarks done as part of the merge process. Guess going forward it might be a nice idea to hook up the parquet benchmarks (maybe also some realistic Hadoop workload) to run either on every commit or daily so that we're able to catch such issues quickly.
Do we have a sense of how much effort the zero-copy API & off heap buffers is and if anyone's actively working on it? I guess I'm wondering if the pay offs from using this change are going to be available say much further out in time, does it make sense to leave head in a state where performance is degraded? If we're able to find some simple fixes on top of the bytebuffer changes that fix the performance issues then that's great. Else it leaves Parquet in a situation where using the tip of head isn't an attractive option for folks (and the way I understand it one of the driving reasons for people to use Parquet too). Either way, I'll try and instrument the Binary calls and look through the PR a bit more to see if I can find some likely causes. Thanks, On Mon, Jun 13, 2016 at 3:34 PM, Ryan Blue <[email protected]> wrote: > Hi Piyush, > > I don't think there was a performance eval done before we merged the > ByteBuffer code. Given how long it has taken to get in, I think it's > reasonable to get it in and then assess and fix performance before we > release. Sounds like you guys have done that preliminary assessment, so > thanks! Bummer that it's not faster, but we do want to be on these changes > in the long run to be able to use the zero-copy API and enable off-heap > buffers. > > I suspect that the problem is buffer management. There are a few places > where we end up making copies if things don't line up properly. For > example, when the getBytes methods are called on Binary objects, if the > beginning isn't at the start of a single byte[], then it will copy. The > good news is that we have the Binary interface so we should be able to > easily instrument where copies are happening and log to track this down. I > think that's a promising start for finding out what's causing this > slow-down. > > rb > > On Mon, Jun 13, 2016 at 1:47 PM, Piyush Narang <[email protected] > > > wrote: > > > hi folks, > > > > We (Twitter) are currently on a much older version of Parquet - SHA: > > b1ea059 > > < > > > https://github.com/apache/parquet-mr/commit/b1ea059a66c7d6d6bb4cb53d2005a9b7bb599ada > > >. > > We wanted to update our internal version to the current Parquet head and > > we've started to see some performance issues. This includes the changes > to > > use ByteBuffer in read & write paths > > < > > > https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 > > > > > and > > the fix for those changes not working on Hadoop / S3 > > <https://github.com/apache/parquet-mr/pull/346>. > > > > With these changes we've seen our Hadoop Parquet read jobs perform 4-6% > > worse (looking at the MB_Millis metric) and 10-20% worse (MB_Millis) for > > Hadoop Parquet write jobs. When we tested out a branch with these changes > > reverted, we noticed that reads perform pretty much at par and writes are > > just slightly slower (improve by 3-10%). > > > > Curious about the context on these changes so I was hoping someone might > > know / be able to chime in: > > 1) Was there any performance testing / benchmarking done on these > changes? > > Has any one else encountered these performance issues? > > 2) Any idea if there are any specific flags / such we need to set for > these > > changes? (didn't see any based on a quick look through of the code but I > > might have missed something). > > 3) Was there a specific feature that these changes were supposed to > enable? > > Wondering if backing them out of head is an option that folks would be > open > > to? > > > > I'm planning to try and dig into what specifically in these changes is > > causing the slow down. Will update this thread based on my findings. We > > might also end up forking Parquet internally to unblock ourselves but > we'd > > like to stay on master if we can.. > > > > Thanks, > > -- > > - Piyush > > > > > > -- > Ryan Blue > Software Engineer > Netflix > -- - Piyush
