Good suggestion Ryan. Added dev@iceberg now.

Dev: Please see early vectorized Iceberg performance results a couple
emails down. This WIP.

thanks,
Anjali.

On Thu, Aug 8, 2019 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote:

> Hi everyone,
>
> Is it possible to copy the Iceberg dev list when sending these emails?
> There are other people in the community that are interested, like Palantir.
> If there isn't anything sensitive then let's try to be more inclusive.
> Thanks!
>
> rb
>
> On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood <anorw...@netflix.com>
> wrote:
>
>> Hi Gautam, Padma,
>> We wanted to update you before Gautam takes off for vacation.
>>
>> Samarth and I profiled the code and found the following:
>> Profiling the IcebergSourceFlatParquetDataReadBenchmark (10 files, 10M
>> rows, a single long column) using visualVM shows two places where CPU time
>> can be optimized:
>> 1) Iterator abstractions (triple iterators, page iterators etc) seem to
>> take up quite a bit of time. Not using these iterators or making them
>> 'batched' iterators and moving the reading of the data close to the file
>> should help ameliorate this problem.
>> 2) Current code goes back and forth between definition levels and value
>> reads through the levels of iterators. Quite a bit of CPU time is spent
>> here. Reading a batch of primitive values at once after consulting the
>> definition level should help improve performance.
>>
>> So, we prototyped the code to walk over the definition levels and read
>> corresponding values in batches (read values till we hit a null, then read
>> nulls till we hit values and so on) and made the iterators batched
>> iterators. Here are the results:
>>
>> Benchmark
>>  Mode  Cnt   Score   Error  Units
>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized
>>  ss    5  10.247 ± 0.202   s/op
>> *IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized
>> ss    5   3.747 ± 0.206   s/op*
>> *IcebergSourceFlatParquetDataReadBenchmark.readIceberg
>>       ss     5  11.286 ± 0.457   s/op*
>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k
>>  ss    5   6.088 ± 0.324   s/op
>> *IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k
>> ss    5   5.875 ± 0.378   s/op*
>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k
>>  ss    5   6.029 ± 0.387   s/op
>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k
>>  ss    5   6.106 ± 0.497   s/op
>>
>>
>> Moreover, as I mentioned to Gautam on chat, we prototyped reading the
>> string column as a byte array without decoding it into UTF8 (above changes
>> were not made at the time) and we saw significant performance improvements
>> there (21.18 secs before Vs 13.031 secs with the change). When used along
>> with batched iterators, these numbers should get better.
>>
>> Note that we haven't tightened/profiled the new code yet (we will start
>> on that next). Just wanted to share some early positive results.
>>
>> regards,
>> Anjali.
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to