Thanks Dongjoon, please see my responses below.

> Is the benchmark a part of the Apache ORC bench module?
> 
> Or, could you share the detail of the benchmark and how to reproduce it in
> the community?

The test we were running was on some actual data and not part of the bench 
module. It will take sometime for me to try and add that as a bench test which 
I will do (hopefully soon), in the meantime I was hoping to hear if others have 
seen anything similar or what stripe size values they are using.

> BTW, it's one of the user configurations. We can change it at Apache ORC
> 2.0 or add simple documentation.
Normally most users are just picking defaults like in this case the feedback 
received was that Parquet performs so much better than ORC. Based on the 
feedback that we receive we can see if we should increase the default value. 
Documentation also helps as guidance until we are able to change the default.

Regards,
Pavan


> On Mar 16, 2023, at 10:29 AM, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote:
> 
> Thank you for raising this issue, Pavan.
> 
> Is the benchmark a part of the Apache ORC bench module?
> 
> Or, could you share the detail of the benchmark and how to reproduce it in
> the community?
> 
> BTW, it's one of the user configurations. We can change it at Apache ORC
> 2.0 or add simple documentation.
> 
> Bests.
> Dongjoon.
> 
> 
> On Thu, Mar 16, 2023 at 10:22 AM Pavan Lanka <pla...@apple.com.invalid>
> wrote:
> 
>> Hi,
>> 
>> I wanted to call out one observation we have seen when performing some
>> benchmarks on ORC.
>> I remember there was a time when the default stripe size was 256MB now we
>> have the default at 64MB.
>> 
>> We see big penalty of staying with the default stripe size of 64MB
>> especially when you compare with the default rowgroup size of 128MB from
>> Parquet. Given that the manner in which limit is enforced is also different
>> between ORC (memory size) and Parquet (serialized size except for the
>> current active pages which is memory size) it gives us a much larger
>> magnification.
>> 
>> For the same data we ran ORC and Parquet with default configurations we
>> see that ORC generates 28K stripes to Parquet 3.5K row groups, this is also
>> reflected in the performance difference, Parquet operation (filtered read
>> of data) took approximately half the cpu seconds as compared to ORC.
>> 
>> Once we adjust the stripe size in ORC to 512MB that matches(roughly
>> equivalent serialized size of the atomic unit in both) with 128MB row group
>> size in parquet we see the following:
>> * the number of stripes and number of row groups are roughly equivalent
>> * the cpu seconds used for the operation are roughly the same
>> 
>> With that background. I wanted to check on what the rationale is in having
>> the default stripe size at 64MB. I see from the commit history this is
>> inherited from Hive.
>> It will be great to get some context around the value of 64MB as the
>> default and would we be with a higher default for the same.
>> 
>> Please share your thoughts.
>> 
>> Thanks,
>> Pavan

Reply via email to