Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

Russell Spitzer Wed, 22 Jun 2022 08:25:25 -0700

+1

On Wed, Jun 22, 2022 at 9:34 AM Piotr Findeisen <[email protected]>
wrote:


> Hi Ajantha,
>
> Thank you for spending the time to look into this.
>
> re a: I think I remember Ryan saying Parquet isn't good for bigger pieces
> of data, and some stats sketches or indices can be bigger than others.
> Also, the Parquet row logical / columnar storage format doesn't give as
> much benefit for what's more closer to key-value storage
>
> re b:
> this is still tbd --
> eg https://github.com/apache/iceberg/pull/4945
> https://github.com/apache/iceberg/pull/5021
>
> re c, e:
> for partition-level, it's not decided yet how it will be handled
>
> re d:
> yes, ANALYZE can be separate operation, see
> https://github.com/trinodb/trino/pull/12317 for POC
>
> Best regards,
> PF
>
>
>
> On Tue, Jun 21, 2022 at 8:52 AM Ajantha Bhat <[email protected]>
> wrote:
>
>> Thank you Piotr for all of the work you’ve put into this.
>>
>> I just checked the spec. I have a few newbie questions.
>>
>> a. Instead of using an existing columnar format like parquet (one file
>> for one type of stats) to store indexes, any reason why we have developed
>> our own format and any benchmarks taken against Puffin vs other formats?
>>
>> b. How these Puffin files are linked to Iceberg's metadata files is still
>> a missing link for me. As the Puffin spec says, these stats are table level
>> (updated per snapshots). So, do we need an Iceberg spec change to store the
>> file names of these Puffin files so that remove_orphan_files will not
>> clean it up accidentally? (also needed for expire_snapshots)
>>
>> c. NDV's are column level stats. So, I expect the latest puffin file of
>> that snapshot will have one row of stats representing stats for each
>> column. But if we are to implement secondary index or table level partition
>> stats, there can be many rows (millions) in puffin based on the dataset.
>> So, for every commit, do we need to read the previous snapshot's Puffin
>> file and write back a new file with updated stats? (the file might be very
>> huge when data grows?). I think it will affect the commit time. Any
>> thoughts on this?
>>
>> d. Slightly related to the above point, do we plan to asynchronously
>> support collecting the stats like "ANALYZE table" and modify the table
>> metadata with the stats file names? (might need an Iceberg commit to write
>> new table metadata)
>>
>> e. Even though table level partition stats are available from _parition
>> metadata table (along with filter push down support), computing metadata
>> table per query will be expensive.
>> Hence, we are looking forward to storing them in the Puffin format. But
>> I'm not sure about storing it as a single file with millions of rows.
>> I Would like to collaborate and discuss more on this.
>>
>> Thanks,
>> Ajantha
>>
>> On Mon, Jun 13, 2022 at 2:45 AM Miao Wang <[email protected]>
>> wrote:
>>
>>> +1 on the format! It looks great!
>>>
>>>
>>>
>>> Thanks for materializing the initial design idea.
>>>
>>>
>>>
>>> Miao
>>>
>>> *From: *Kyle Bendickson <[email protected]>
>>> *Date: *Sunday, June 12, 2022 at 1:55 PM
>>> *To: *[email protected] <[email protected]>
>>> *Subject: *Re: [VOTE] Adopt Puffin format as a file format for
>>> statistics and indexes
>>>
>>> *EXTERNAL: Use caution when clicking on links or opening attachments.*
>>>
>>>
>>>
>>> +1 [non-binding]
>>>
>>>
>>>
>>> Thank you Piotr for all of the work you’ve put into this.
>>>
>>>
>>>
>>> This should greatly benefit not only Iceberg on Trino, but hopefully can
>>> be used in many novel ways due to its well thought out generic design and
>>> incorporation of the ability to extend with new sketches.
>>>
>>>
>>>
>>> Looking forward to the improvements this will bring.
>>>
>>>
>>>
>>> - Kyle
>>>
>>>
>>>
>>> On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo <[email protected]>
>>> wrote:
>>>
>>> +1, let's do it!
>>>
>>>
>>>
>>> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge <[email protected]> wrote:
>>>
>>> +1  Looking forward to the features it enables.
>>>
>>>
>>>
>>> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu <[email protected]> wrote:
>>>
>>> +1. Looking forward to the partition stats.
>>>
>>> Best,
>>>
>>>
>>>
>>> Yufei
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks <[email protected]> wrote:
>>>
>>> +1 as well.  Excited about the progress here.
>>>
>>>
>>>
>>> -Dan
>>>
>>> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen <[email protected]>
>>> wrote:
>>>
>>> +1, really nice! Indexes are coming!
>>>
>>>
>>>
>>> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho <[email protected]>
>>> wrote:
>>>
>>> +1, it's an exciting step for Iceberg, look forward to all the new
>>> statistics and secondary indices it will allow.
>>>
>>>
>>>
>>> Had a few questions of what the reference to Puffin file(s) will be in
>>> the Iceberg spec, but it's orthogonal to Puffin file format itself.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Szehon
>>>
>>>
>>>
>>> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue <[email protected]> wrote:
>>>
>>> +1 from me!
>>>
>>>
>>>
>>> There may also be people that haven't followed the design discussions
>>> and we can start a DISCUSS thread if needed. But if everyone is comfortable
>>> with the design and implementation, I think it's ready for a vote as well.
>>>
>>>
>>>
>>> Huge thanks to Piotr for getting this ready! I think the format is going
>>> to be really useful for both stats and indexes in Iceberg.
>>>
>>>
>>>
>>> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen <[email protected]>
>>> wrote:
>>>
>>> Hi Everyone,
>>>
>>> I propose that we adopt Puffin file format as a file format for
>>> statistics and indexes in Iceberg tables.
>>>
>>>
>>>
>>> Puffin file format specification:
>>>
>>> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Fformat%2Fpuffin-spec.md&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3Y04jqMw6ZIc%2BojDmWlpOeLL5zQ3YvLcdAgoHJTwL8c%3D&reserved=0>
>>>
>>> (previous discussions:  https://github.com/apache/iceberg/pull/4944
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4944&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tWuoyTfEaIWmOFivROQRt0fD1KRYc%2FqwRO2KoZhIoi8%3D&reserved=0>
>>> , https://github.com/apache/iceberg-docs/pull/69
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg-docs%2Fpull%2F69&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Uf8XiuLSLEO8YtCMkk%2BSXWS6lefw95O22K844P5Iovc%3D&reserved=0>
>>> )
>>>
>>>
>>>
>>> Intend use:
>>>
>>> * statistics in Iceberg tables (see
>>> https://github.com/apache/iceberg/pull/4945
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4945&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=swByVgNPD6lbSlJjHIZZX4jgeVzC%2BT%2BWUvxrrg0Wpx8%3D&reserved=0>
>>> and associated proposed implementation
>>> https://github.com/apache/iceberg/pull/4741
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4741&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dYckyv1f36iQqs9%2FaRQRsumtB2xEmwcFJAQihYZRYlw%3D&reserved=0>
>>> )
>>>
>>> * in the future: storage for secondary indexes
>>>
>>>
>>>
>>> Puffin file reader and writer implementation:
>>>
>>> https://github.com/apache/iceberg/pull/4537
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4537&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YY%2B52Eq%2FcnnseM5Nd4E0D3Xw8IWMsD4QaI98LXFMu9c%3D&reserved=0>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> PF
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Tabular
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Best Regards
>>>
>>>
>>>
>>>
>>> --
>>>
>>> John Zhuge
>>>
>>>

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

Reply via email to