Thanks Anjali and Samarth, 
   These look good! Great progress.  Can you push your changes to the 
vectorized-read branch please?

Sent from my iPhone

> On Aug 8, 2019, at 11:56 AM, Anjali Norwood <anorw...@netflix.com> wrote:
> 
> Good suggestion Ryan. Added dev@iceberg now.
> 
> Dev: Please see early vectorized Iceberg performance results a couple emails 
> down. This WIP.
> 
> thanks, 
> Anjali.
> 
>> On Thu, Aug 8, 2019 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote:
>> Hi everyone,
>> 
>> Is it possible to copy the Iceberg dev list when sending these emails? There 
>> are other people in the community that are interested, like Palantir. If 
>> there isn't anything sensitive then let's try to be more inclusive. Thanks!
>> 
>> rb
>> 
>>> On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood <anorw...@netflix.com> wrote:
>>> Hi Gautam, Padma,
>>> We wanted to update you before Gautam takes off for vacation. 
>>> 
>>> Samarth and I profiled the code and found the following:
>>> Profiling the IcebergSourceFlatParquetDataReadBenchmark (10 files, 10M 
>>> rows, a single long column) using visualVM shows two places where CPU time 
>>> can be optimized:
>>> 1) Iterator abstractions (triple iterators, page iterators etc) seem to 
>>> take up quite a bit of time. Not using these iterators or making them 
>>> 'batched' iterators and moving the reading of the data close to the file 
>>> should help ameliorate this problem.
>>> 2) Current code goes back and forth between definition levels and value 
>>> reads through the levels of iterators. Quite a bit of CPU time is spent 
>>> here. Reading a batch of primitive values at once after consulting the 
>>> definition level should help improve performance.
>>> 
>>> So, we prototyped the code to walk over the definition levels and read 
>>> corresponding values in batches (read values till we hit a null, then read 
>>> nulls till we hit values and so on) and made the iterators batched 
>>> iterators. Here are the results:
>>> 
>>> Benchmark                                                              Mode 
>>>  Cnt   Score   Error  Units
>>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized    ss 
>>>    5  10.247 ± 0.202   s/op
>>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized       ss 
>>>    5   3.747 ± 0.206   s/op
>>> IcebergSourceFlatParquetDataReadBenchmark.readIceberg                       
>>>    ss     5  11.286 ± 0.457   s/op
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k      ss 
>>>    5   6.088 ± 0.324   s/op
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k       ss 
>>>    5   5.875 ± 0.378   s/op
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k        ss 
>>>    5   6.029 ± 0.387   s/op
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k        ss 
>>>    5   6.106 ± 0.497   s/op
>>> 
>>> Moreover, as I mentioned to Gautam on chat, we prototyped reading the 
>>> string column as a byte array without decoding it into UTF8 (above changes 
>>> were not made at the time) and we saw significant performance improvements 
>>> there (21.18 secs before Vs 13.031 secs with the change). When used along 
>>> with batched iterators, these numbers should get better.
>>> 
>>> Note that we haven't tightened/profiled the new code yet (we will start on 
>>> that next). Just wanted to share some early positive results. 
>>> 
>>> regards, 
>>> Anjali.
>>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix

Reply via email to