I would also like to keep the discussion around potentially (temporarily) reverting the byte buffer changes open.
I understand that this enables us to use the zero copy API and off-heap buffers, but realistically, is anyone using either of those and are they seeing performance benefits from them? Parquet is supposed to be a high performance file format, and this change has lead to a 10% drop in performance. I know it's been difficult to get this merged into master, but ultimately I feel that we should be rigorous about the performance regression here. I agree that the best thing to do would be to fix the byte buffer codepath to match the previous performance. Piyush has been taking a pretty in depth look at that, and so far there aren't any obvious things that standout. Even fixing some places with unnecessary copies haven't made much of a difference as far as I know. We can keep looking into that, but if nobody can get to the bottom of it, I think we should consider reverting it until it's in a state that passes some performance benchmarks we can come up with. If the whole point of the byte buffers work is to open up performance optimizations, then the result of merging it shouldn't be performance degradation. I think a good counter-point would be that someone is using the zero-copy API and that it's working really great. What do you think? Thanks, Alex On Mon, Jun 13, 2016 at 6:22 PM, Piyush Narang <[email protected]> wrote: > hey Ryan, > > Thanks for getting back. Too bad we didn't run some perf tests prior to > merging those changes in. Was quite a massive change so would've been nice > to have some benchmarks done as part of the merge process. Guess going > forward it might be a nice idea to hook up the parquet benchmarks (maybe > also some realistic Hadoop workload) to run either on every commit or daily > so that we're able to catch such issues quickly. > > Do we have a sense of how much effort the zero-copy API & off heap buffers > is and if anyone's actively working on it? I guess I'm wondering if the pay > offs from using this change are going to be available say much further out > in time, does it make sense to leave head in a state where performance is > degraded? If we're able to find some simple fixes on top of the bytebuffer > changes that fix the performance issues then that's great. Else it leaves > Parquet in a situation where using the tip of head isn't an attractive > option for folks (and the way I understand it one of the driving reasons > for people to use Parquet too). > > Either way, I'll try and instrument the Binary calls and look through the > PR a bit more to see if I can find some likely causes. > > Thanks, > > On Mon, Jun 13, 2016 at 3:34 PM, Ryan Blue <[email protected]> > wrote: > > > Hi Piyush, > > > > I don't think there was a performance eval done before we merged the > > ByteBuffer code. Given how long it has taken to get in, I think it's > > reasonable to get it in and then assess and fix performance before we > > release. Sounds like you guys have done that preliminary assessment, so > > thanks! Bummer that it's not faster, but we do want to be on these > changes > > in the long run to be able to use the zero-copy API and enable off-heap > > buffers. > > > > I suspect that the problem is buffer management. There are a few places > > where we end up making copies if things don't line up properly. For > > example, when the getBytes methods are called on Binary objects, if the > > beginning isn't at the start of a single byte[], then it will copy. The > > good news is that we have the Binary interface so we should be able to > > easily instrument where copies are happening and log to track this down. > I > > think that's a promising start for finding out what's causing this > > slow-down. > > > > rb > > > > On Mon, Jun 13, 2016 at 1:47 PM, Piyush Narang > <[email protected] > > > > > wrote: > > > > > hi folks, > > > > > > We (Twitter) are currently on a much older version of Parquet - SHA: > > > b1ea059 > > > < > > > > > > https://github.com/apache/parquet-mr/commit/b1ea059a66c7d6d6bb4cb53d2005a9b7bb599ada > > > >. > > > We wanted to update our internal version to the current Parquet head > and > > > we've started to see some performance issues. This includes the changes > > to > > > use ByteBuffer in read & write paths > > > < > > > > > > https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 > > > > > > > and > > > the fix for those changes not working on Hadoop / S3 > > > <https://github.com/apache/parquet-mr/pull/346>. > > > > > > With these changes we've seen our Hadoop Parquet read jobs perform 4-6% > > > worse (looking at the MB_Millis metric) and 10-20% worse (MB_Millis) > for > > > Hadoop Parquet write jobs. When we tested out a branch with these > changes > > > reverted, we noticed that reads perform pretty much at par and writes > are > > > just slightly slower (improve by 3-10%). > > > > > > Curious about the context on these changes so I was hoping someone > might > > > know / be able to chime in: > > > 1) Was there any performance testing / benchmarking done on these > > changes? > > > Has any one else encountered these performance issues? > > > 2) Any idea if there are any specific flags / such we need to set for > > these > > > changes? (didn't see any based on a quick look through of the code but > I > > > might have missed something). > > > 3) Was there a specific feature that these changes were supposed to > > enable? > > > Wondering if backing them out of head is an option that folks would be > > open > > > to? > > > > > > I'm planning to try and dig into what specifically in these changes is > > > causing the slow down. Will update this thread based on my findings. We > > > might also end up forking Parquet internally to unblock ourselves but > > we'd > > > like to stay on master if we can.. > > > > > > Thanks, > > > -- > > > - Piyush > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > -- > - Piyush > -- Alex Levenson @THISWILLWORK
