+1 On Wed, Jun 22, 2022 at 9:34 AM Piotr Findeisen <pi...@starburstdata.com> wrote:
> Hi Ajantha, > > Thank you for spending the time to look into this. > > re a: I think I remember Ryan saying Parquet isn't good for bigger pieces > of data, and some stats sketches or indices can be bigger than others. > Also, the Parquet row logical / columnar storage format doesn't give as > much benefit for what's more closer to key-value storage > > re b: > this is still tbd -- > eg https://github.com/apache/iceberg/pull/4945 > https://github.com/apache/iceberg/pull/5021 > > re c, e: > for partition-level, it's not decided yet how it will be handled > > re d: > yes, ANALYZE can be separate operation, see > https://github.com/trinodb/trino/pull/12317 for POC > > Best regards, > PF > > > > On Tue, Jun 21, 2022 at 8:52 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >> Thank you Piotr for all of the work you’ve put into this. >> >> I just checked the spec. I have a few newbie questions. >> >> a. Instead of using an existing columnar format like parquet (one file >> for one type of stats) to store indexes, any reason why we have developed >> our own format and any benchmarks taken against Puffin vs other formats? >> >> b. How these Puffin files are linked to Iceberg's metadata files is still >> a missing link for me. As the Puffin spec says, these stats are table level >> (updated per snapshots). So, do we need an Iceberg spec change to store the >> file names of these Puffin files so that remove_orphan_files will not >> clean it up accidentally? (also needed for expire_snapshots) >> >> c. NDV's are column level stats. So, I expect the latest puffin file of >> that snapshot will have one row of stats representing stats for each >> column. But if we are to implement secondary index or table level partition >> stats, there can be many rows (millions) in puffin based on the dataset. >> So, for every commit, do we need to read the previous snapshot's Puffin >> file and write back a new file with updated stats? (the file might be very >> huge when data grows?). I think it will affect the commit time. Any >> thoughts on this? >> >> d. Slightly related to the above point, do we plan to asynchronously >> support collecting the stats like "ANALYZE table" and modify the table >> metadata with the stats file names? (might need an Iceberg commit to write >> new table metadata) >> >> e. Even though table level partition stats are available from _parition >> metadata table (along with filter push down support), computing metadata >> table per query will be expensive. >> Hence, we are looking forward to storing them in the Puffin format. But >> I'm not sure about storing it as a single file with millions of rows. >> I Would like to collaborate and discuss more on this. >> >> Thanks, >> Ajantha >> >> On Mon, Jun 13, 2022 at 2:45 AM Miao Wang <miw...@adobe.com.invalid> >> wrote: >> >>> +1 on the format! It looks great! >>> >>> >>> >>> Thanks for materializing the initial design idea. >>> >>> >>> >>> Miao >>> >>> *From: *Kyle Bendickson <kjbendick...@gmail.com> >>> *Date: *Sunday, June 12, 2022 at 1:55 PM >>> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> >>> *Subject: *Re: [VOTE] Adopt Puffin format as a file format for >>> statistics and indexes >>> >>> *EXTERNAL: Use caution when clicking on links or opening attachments.* >>> >>> >>> >>> +1 [non-binding] >>> >>> >>> >>> Thank you Piotr for all of the work you’ve put into this. >>> >>> >>> >>> This should greatly benefit not only Iceberg on Trino, but hopefully can >>> be used in many novel ways due to its well thought out generic design and >>> incorporation of the ability to extend with new sketches. >>> >>> >>> >>> Looking forward to the improvements this will bring. >>> >>> >>> >>> - Kyle >>> >>> >>> >>> On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo <alex...@starburstdata.com> >>> wrote: >>> >>> +1, let's do it! >>> >>> >>> >>> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge <jzh...@apache.org> wrote: >>> >>> +1 Looking forward to the features it enables. >>> >>> >>> >>> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu <flyrain...@gmail.com> wrote: >>> >>> +1. Looking forward to the partition stats. >>> >>> Best, >>> >>> >>> >>> Yufei >>> >>> >>> >>> >>> >>> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks <dwe...@apache.org> wrote: >>> >>> +1 as well. Excited about the progress here. >>> >>> >>> >>> -Dan >>> >>> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen <chenjunjied...@gmail.com> >>> wrote: >>> >>> +1, really nice! Indexes are coming! >>> >>> >>> >>> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>> +1, it's an exciting step for Iceberg, look forward to all the new >>> statistics and secondary indices it will allow. >>> >>> >>> >>> Had a few questions of what the reference to Puffin file(s) will be in >>> the Iceberg spec, but it's orthogonal to Puffin file format itself. >>> >>> >>> >>> Thanks, >>> >>> Szehon >>> >>> >>> >>> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue <b...@tabular.io> wrote: >>> >>> +1 from me! >>> >>> >>> >>> There may also be people that haven't followed the design discussions >>> and we can start a DISCUSS thread if needed. But if everyone is comfortable >>> with the design and implementation, I think it's ready for a vote as well. >>> >>> >>> >>> Huge thanks to Piotr for getting this ready! I think the format is going >>> to be really useful for both stats and indexes in Iceberg. >>> >>> >>> >>> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen <pi...@starburstdata.com> >>> wrote: >>> >>> Hi Everyone, >>> >>> I propose that we adopt Puffin file format as a file format for >>> statistics and indexes in Iceberg tables. >>> >>> >>> >>> Puffin file format specification: >>> >>> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Fformat%2Fpuffin-spec.md&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3Y04jqMw6ZIc%2BojDmWlpOeLL5zQ3YvLcdAgoHJTwL8c%3D&reserved=0> >>> >>> (previous discussions: https://github.com/apache/iceberg/pull/4944 >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4944&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tWuoyTfEaIWmOFivROQRt0fD1KRYc%2FqwRO2KoZhIoi8%3D&reserved=0> >>> , https://github.com/apache/iceberg-docs/pull/69 >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg-docs%2Fpull%2F69&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Uf8XiuLSLEO8YtCMkk%2BSXWS6lefw95O22K844P5Iovc%3D&reserved=0> >>> ) >>> >>> >>> >>> Intend use: >>> >>> * statistics in Iceberg tables (see >>> https://github.com/apache/iceberg/pull/4945 >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4945&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=swByVgNPD6lbSlJjHIZZX4jgeVzC%2BT%2BWUvxrrg0Wpx8%3D&reserved=0> >>> and associated proposed implementation >>> https://github.com/apache/iceberg/pull/4741 >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4741&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dYckyv1f36iQqs9%2FaRQRsumtB2xEmwcFJAQihYZRYlw%3D&reserved=0> >>> ) >>> >>> * in the future: storage for secondary indexes >>> >>> >>> >>> Puffin file reader and writer implementation: >>> >>> https://github.com/apache/iceberg/pull/4537 >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4537&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YY%2B52Eq%2FcnnseM5Nd4E0D3Xw8IWMsD4QaI98LXFMu9c%3D&reserved=0> >>> >>> >>> >>> Thanks, >>> >>> PF >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Ryan Blue >>> >>> Tabular >>> >>> >>> >>> >>> -- >>> >>> Best Regards >>> >>> >>> >>> >>> -- >>> >>> John Zhuge >>> >>>