Re: Performance impact of using ByteBuffers for Parquet reads / writes

Ryan Blue Mon, 13 Jun 2016 15:35:27 -0700

Hi Piyush,

I don't think there was a performance eval done before we merged the
ByteBuffer code. Given how long it has taken to get in, I think it's
reasonable to get it in and then assess and fix performance before we
release. Sounds like you guys have done that preliminary assessment, so
thanks! Bummer that it's not faster, but we do want to be on these changes
in the long run to be able to use the zero-copy API and enable off-heap
buffers.


I suspect that the problem is buffer management. There are a few places
where we end up making copies if things don't line up properly. For
example, when the getBytes methods are called on Binary objects, if the
beginning isn't at the start of a single byte[], then it will copy. The
good news is that we have the Binary interface so we should be able to
easily instrument where copies are happening and log to track this down. I
think that's a promising start for finding out what's causing this
slow-down.

rb

On Mon, Jun 13, 2016 at 1:47 PM, Piyush Narang <[email protected]>
wrote:

> hi folks,
>
> We (Twitter) are currently on a much older version of Parquet - SHA:
> b1ea059
> <
> https://github.com/apache/parquet-mr/commit/b1ea059a66c7d6d6bb4cb53d2005a9b7bb599ada
> >.
> We wanted to update our internal version to the current Parquet head and
> we've started to see some performance issues. This includes the changes to
> use ByteBuffer in read & write paths
> <
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
> >
> and
> the fix for those changes not working on Hadoop / S3
> <https://github.com/apache/parquet-mr/pull/346>.
>
> With these changes we've seen our Hadoop Parquet read jobs perform 4-6%
> worse (looking at the MB_Millis metric) and 10-20% worse (MB_Millis) for
> Hadoop Parquet write jobs. When we tested out a branch with these changes
> reverted, we noticed that reads perform pretty much at par and writes are
> just slightly slower (improve by 3-10%).
>
> Curious about the context on these changes so I was hoping someone might
> know / be able to chime in:
> 1) Was there any performance testing / benchmarking done on these changes?
> Has any one else encountered these performance issues?
> 2) Any idea if there are any specific flags / such we need to set for these
> changes? (didn't see any based on a quick look through of the code but I
> might have missed something).
> 3) Was there a specific feature that these changes were supposed to enable?
> Wondering if backing them out of head is an option that folks would be open
> to?
>
> I'm planning to try and dig into what specifically in these changes is
> causing the slow down. Will update this thread based on my findings. We
> might also end up forking Parquet internally to unblock ourselves but we'd
> like to stay on master if we can..
>
> Thanks,
> --
> - Piyush
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Performance impact of using ByteBuffers for Parquet reads / writes

Reply via email to