Good suggestion Ryan. Added dev@iceberg now. Dev: Please see early vectorized Iceberg performance results a couple emails down. This WIP.
thanks, Anjali. On Thu, Aug 8, 2019 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote: > Hi everyone, > > Is it possible to copy the Iceberg dev list when sending these emails? > There are other people in the community that are interested, like Palantir. > If there isn't anything sensitive then let's try to be more inclusive. > Thanks! > > rb > > On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood <anorw...@netflix.com> > wrote: > >> Hi Gautam, Padma, >> We wanted to update you before Gautam takes off for vacation. >> >> Samarth and I profiled the code and found the following: >> Profiling the IcebergSourceFlatParquetDataReadBenchmark (10 files, 10M >> rows, a single long column) using visualVM shows two places where CPU time >> can be optimized: >> 1) Iterator abstractions (triple iterators, page iterators etc) seem to >> take up quite a bit of time. Not using these iterators or making them >> 'batched' iterators and moving the reading of the data close to the file >> should help ameliorate this problem. >> 2) Current code goes back and forth between definition levels and value >> reads through the levels of iterators. Quite a bit of CPU time is spent >> here. Reading a batch of primitive values at once after consulting the >> definition level should help improve performance. >> >> So, we prototyped the code to walk over the definition levels and read >> corresponding values in batches (read values till we hit a null, then read >> nulls till we hit values and so on) and made the iterators batched >> iterators. Here are the results: >> >> Benchmark >> Mode Cnt Score Error Units >> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized >> ss 5 10.247 ± 0.202 s/op >> *IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized >> ss 5 3.747 ± 0.206 s/op* >> *IcebergSourceFlatParquetDataReadBenchmark.readIceberg >> ss 5 11.286 ± 0.457 s/op* >> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k >> ss 5 6.088 ± 0.324 s/op >> *IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k >> ss 5 5.875 ± 0.378 s/op* >> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k >> ss 5 6.029 ± 0.387 s/op >> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k >> ss 5 6.106 ± 0.497 s/op >> >> >> Moreover, as I mentioned to Gautam on chat, we prototyped reading the >> string column as a byte array without decoding it into UTF8 (above changes >> were not made at the time) and we saw significant performance improvements >> there (21.18 secs before Vs 13.031 secs with the change). When used along >> with batched iterators, these numbers should get better. >> >> Note that we haven't tightened/profiled the new code yet (we will start >> on that next). Just wanted to share some early positive results. >> >> regards, >> Anjali. >> >> > > -- > Ryan Blue > Software Engineer > Netflix >