Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Jack Ye
Thanks for the fast responses!

Based on the conversations above, it sounds like we have the following
consensus:

1. asynchronous index creation is preferred, although synchronous index
creation is possible.
2. a mechanism for tracking file change is needed. Unfortunately sequence
number cannot be used due to the introduction of compaction that rewrites
files into a lower sequence number. Another monotonically increasing
watermark for files has to be introduced for index change detection and
invalidation.
3. index creation and maintenance procedures should be pluggable by
different engines. This should not be an issue because Iceberg has been
designing action interfaces for different table maintenance procedures so
far, so what Zaicheng describes should be the natural development direction
once the work is started.

Regarding index level, I also think partition level index is more
important, but it seems like we have to first do file level as the
foundation. This leads to the index storage part. I am not talking about
using Parquet to store it, I am asking about what Miao is describing. I
don't think we have a consensus around the exact place to store index
information yet. My memory is that there are 2 ways:
1. file level index stored as a binary field in manifest, partition level
index stored as a binary field in manifest list. This would only work for
small size indexes like bitmap (or bloom filter to certain extent)
2. some sort of binary file to store index data, and index metadata (e.g.
index type) and pointer to the binary index data file is kept in 1 (I think
this is what Miao is describing)
3. some sort of index spec to independently store index metadata and data,
similar to what we are proposing today for view

Another aspect of index storage is the index file location in case of 2 and
3. In the original doc a specific file path structure is proposed, whereas
this is a bit against the Iceberg standard of not assuming file path to
work with any storage. We also need more clarity in that topic.

Best,
Jack Ye


On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang 
wrote:

> Thanks for having the thread. This is Zaicheng from bytedance.
>
> Initially we are planning to add index feature for our internal Trino and
> feel like iceberg could be the best place for holding/buiding the index
> data.
> We are very interested in having and contributing to this feature. (Pretty
> new to the community, still having my 2 cents)
>
> Echo on what Miao mentioned on 4): I feel iceberg could provide interface
> for creating/updating/deleting index and each engine can decide how to
> invoke these method (in a distributed manner or single thread manner, in
> async or sync).
> Take our use case as an example, we plan to have a new DDL syntax "create
> index id_1 on table col_1 using bloom"/"update index id_1 on table col_1",
> and our SQL engine will create distributed index creation/updating
> operator. Each operator will invoke the index related method provided by
> iceberg.
>
> Storage): Does the index data have to be a file? Wondering if we want to
> design the index data storage interface in such way that people can plugin
> different index storage(file storage/centralized index storage service)
> later on.
>
> Thanks,
> Zaicheng
>
>
> Miao Wang  于2022年1月26日周三 10:22写道:
>
>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance created
>> a slack channel for index work. I suggested him adding Anton and you to the
>> channel.
>>
>>
>>
>> I still remember some conclusions from previous discussions.
>>
>>
>>
>> 1). Index types support: We planned to support Skipping Index first.
>> Iceberg metadata exposes hints whether the tracked data files have index
>> which reduces index reading overhead. Index file can be applied when
>> generating the scan task.
>>
>>
>>
>> 2). As Ryan mentioned, Sequence number will be used to indicate whether
>> an index is valid. Sequence number can link the data evolution with index
>> evolution.
>>
>>
>>
>> 3). Storage: We planned to have simple file format which includes Column
>> Name/ID, Index Type (String), Index content length, and binary content. It
>> is not necessary to use Parquet to store index. Initial thought was 1 data
>> file mapping to 1 index file. It can be merged to 1 partition mapping to 1
>> index file. As Ryan said, file level implementation could be a step stone
>> for Partition level implementation.
>>
>>
>>
>> 4). How to build index: We want to keep the index reading and writing
>> interface with Iceberg and leave the actual building logic as Engine
>> specific (i.e., we can use different compute to build Index without
>> changing anything inside Iceberg).
>>
>>
>>
>> Misc:
>>
>> Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
>>
>> Design doc:
>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>
>> PR should have been merged.
>>
>> Guy from IBM did a partial PoC and provided a private doc. I will 

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Zaicheng Wang
Thanks for having the thread. This is Zaicheng from bytedance.

Initially we are planning to add index feature for our internal Trino and
feel like iceberg could be the best place for holding/buiding the index
data.
We are very interested in having and contributing to this feature. (Pretty
new to the community, still having my 2 cents)

Echo on what Miao mentioned on 4): I feel iceberg could provide interface
for creating/updating/deleting index and each engine can decide how to
invoke these method (in a distributed manner or single thread manner, in
async or sync).
Take our use case as an example, we plan to have a new DDL syntax "create
index id_1 on table col_1 using bloom"/"update index id_1 on table col_1",
and our SQL engine will create distributed index creation/updating
operator. Each operator will invoke the index related method provided by
iceberg.

Storage): Does the index data have to be a file? Wondering if we want to
design the index data storage interface in such way that people can plugin
different index storage(file storage/centralized index storage service)
later on.

Thanks,
Zaicheng


Miao Wang  于2022年1月26日周三 10:22写道:

> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance created
> a slack channel for index work. I suggested him adding Anton and you to the
> channel.
>
>
>
> I still remember some conclusions from previous discussions.
>
>
>
> 1). Index types support: We planned to support Skipping Index first.
> Iceberg metadata exposes hints whether the tracked data files have index
> which reduces index reading overhead. Index file can be applied when
> generating the scan task.
>
>
>
> 2). As Ryan mentioned, Sequence number will be used to indicate whether an
> index is valid. Sequence number can link the data evolution with index
> evolution.
>
>
>
> 3). Storage: We planned to have simple file format which includes Column
> Name/ID, Index Type (String), Index content length, and binary content. It
> is not necessary to use Parquet to store index. Initial thought was 1 data
> file mapping to 1 index file. It can be merged to 1 partition mapping to 1
> index file. As Ryan said, file level implementation could be a step stone
> for Partition level implementation.
>
>
>
> 4). How to build index: We want to keep the index reading and writing
> interface with Iceberg and leave the actual building logic as Engine
> specific (i.e., we can use different compute to build Index without
> changing anything inside Iceberg).
>
>
>
> Misc:
>
> Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
>
> Design doc:
> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>
> PR should have been merged.
>
> Guy from IBM did a partial PoC and provided a private doc. I will ask if
> he can make it public.
>
>
>
> We can continue the discussion and breaking down the big tasks into
> tickets.
>
>
>
> Thanks!
>
>
>
> Miao
>
> *From: *Ryan Blue 
> *Date: *Tuesday, January 25, 2022 at 5:08 PM
> *To: *Iceberg Dev List 
> *Subject: *Re: Continuing the Secondary Index Discussion
>
> Thanks for raising this for discussion, Jack! It would be great to start
> adding more indexes.
>
>
>
> > Scope of native index support
>
>
>
> The way I think about it, the biggest challenge here is how to know when
> you can use an index. For example, if you have a partition index that is up
> to date as of snapshot 13764091836784, but the current snapshot is
> 97613097151667, then you basically have no idea what files are covered or
> not and can't use it. On the other hand, if you know that the index was up
> to date as of sequence number 11 and you're reading sequence number 12,
> then you just have to read any data file that was written at sequence
> number 12.
>
>
>
> The problem of where you can use an index makes me think that it is best
> to maintain index metadata within Iceberg. An alternative is to try to
> always keep the index up-to-date, but I don't think that's necessarily
> possible -- you'd have to support index updates in every writer that
> touches table data. You would have to spend the time updating indexes at
> write time, but there are competing priorities like making data available.
> So I think you want asynchronous index updates and that leads to
> integration with the table format.
>
>
>
> > Index levels
>
>
>
> I think that partition-level indexes are better for job planning
> (eliminate whole partitions!) but file-level are still useful for skipping
> files at the task level. I would probably focus on partition-level, but I'm
> not strongly opinionated here. File-level is probably a stepping stone to
> partition-level, given that we would be able to track index data in the
> same format.
>
>
>
> > Index storage
>
>
>
> Do you mean putting indexes in Parquet, or using Parquet for indexes? I
> think that bloom filters would probably exceed the amount of data we'd want
> to put into a Parquet binary column, probably at the file level and almost

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Miao Wang
Thanks Jack for resuming the discussion. Zaicheng from Byte Dance created a 
slack channel for index work. I suggested him adding Anton and you to the 
channel.

I still remember some conclusions from previous discussions.

1). Index types support: We planned to support Skipping Index first. Iceberg 
metadata exposes hints whether the tracked data files have index which reduces 
index reading overhead. Index file can be applied when generating the scan task.

2). As Ryan mentioned, Sequence number will be used to indicate whether an 
index is valid. Sequence number can link the data evolution with index 
evolution.

3). Storage: We planned to have simple file format which includes Column 
Name/ID, Index Type (String), Index content length, and binary content. It is 
not necessary to use Parquet to store index. Initial thought was 1 data file 
mapping to 1 index file. It can be merged to 1 partition mapping to 1 index 
file. As Ryan said, file level implementation could be a step stone for 
Partition level implementation.

4). How to build index: We want to keep the index reading and writing interface 
with Iceberg and leave the actual building logic as Engine specific (i.e., we 
can use different compute to build Index without changing anything inside 
Iceberg).

Misc:
Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
Design doc: 
https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
PR should have been merged.
Guy from IBM did a partial PoC and provided a private doc. I will ask if he can 
make it public.

We can continue the discussion and breaking down the big tasks into tickets.

Thanks!

Miao
From: Ryan Blue 
Date: Tuesday, January 25, 2022 at 5:08 PM
To: Iceberg Dev List 
Subject: Re: Continuing the Secondary Index Discussion
Thanks for raising this for discussion, Jack! It would be great to start adding 
more indexes.

> Scope of native index support

The way I think about it, the biggest challenge here is how to know when you 
can use an index. For example, if you have a partition index that is up to date 
as of snapshot 13764091836784, but the current snapshot is 97613097151667, then 
you basically have no idea what files are covered or not and can't use it. On 
the other hand, if you know that the index was up to date as of sequence number 
11 and you're reading sequence number 12, then you just have to read any data 
file that was written at sequence number 12.

The problem of where you can use an index makes me think that it is best to 
maintain index metadata within Iceberg. An alternative is to try to always keep 
the index up-to-date, but I don't think that's necessarily possible -- you'd 
have to support index updates in every writer that touches table data. You 
would have to spend the time updating indexes at write time, but there are 
competing priorities like making data available. So I think you want 
asynchronous index updates and that leads to integration with the table format.

> Index levels

I think that partition-level indexes are better for job planning (eliminate 
whole partitions!) but file-level are still useful for skipping files at the 
task level. I would probably focus on partition-level, but I'm not strongly 
opinionated here. File-level is probably a stepping stone to partition-level, 
given that we would be able to track index data in the same format.

> Index storage

Do you mean putting indexes in Parquet, or using Parquet for indexes? I think 
that bloom filters would probably exceed the amount of data we'd want to put 
into a Parquet binary column, probably at the file level and almost certainly 
at the partition level, since the size depends on the number of distinct values 
and the primary use is for identifiers.

> Indexing process

Synchronous is nice, but as I said above, I think we have to support async 
because it is too complicated to update every writer that touches a table and 
you may not want to pay the price at write time.

> Index validation

I think this is pretty much what I talked about for question 1. I think that we 
have a good plan around using sequence numbers, if we want to do this.

Ryan

On Tue, Jan 25, 2022 at 3:23 PM Jack Ye 
mailto:yezhao...@gmail.com>> wrote:
Hi everyone,

Based on the conversation in the last community sync and the Iceberg Slack 
channel, it seems like multiple parties have interest in continuing the effort 
related to the secondary index in Iceberg, so I would like to restart the 
thread to continue the discussion.

So far most people refer to the document authored by Miao 

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Ryan Blue
Thanks for raising this for discussion, Jack! It would be great to start
adding more indexes.

> Scope of native index support

The way I think about it, the biggest challenge here is how to know when
you can use an index. For example, if you have a partition index that is up
to date as of snapshot 13764091836784, but the current snapshot is
97613097151667, then you basically have no idea what files are covered or
not and can't use it. On the other hand, if you know that the index was up
to date as of sequence number 11 and you're reading sequence number 12,
then you just have to read any data file that was written at sequence
number 12.

The problem of where you can use an index makes me think that it is best to
maintain index metadata within Iceberg. An alternative is to try to always
keep the index up-to-date, but I don't think that's necessarily possible --
you'd have to support index updates in every writer that touches table
data. You would have to spend the time updating indexes at write time, but
there are competing priorities like making data available. So I think you
want asynchronous index updates and that leads to integration with the
table format.

> Index levels

I think that partition-level indexes are better for job planning (eliminate
whole partitions!) but file-level are still useful for skipping files at
the task level. I would probably focus on partition-level, but I'm not
strongly opinionated here. File-level is probably a stepping stone to
partition-level, given that we would be able to track index data in the
same format.

> Index storage

Do you mean putting indexes in Parquet, or using Parquet for indexes? I
think that bloom filters would probably exceed the amount of data we'd want
to put into a Parquet binary column, probably at the file level and almost
certainly at the partition level, since the size depends on the number of
distinct values and the primary use is for identifiers.

> Indexing process

Synchronous is nice, but as I said above, I think we have to support async
because it is too complicated to update every writer that touches a table
and you may not want to pay the price at write time.

> Index validation

I think this is pretty much what I talked about for question 1. I think
that we have a good plan around using sequence numbers, if we want to do
this.

Ryan

On Tue, Jan 25, 2022 at 3:23 PM Jack Ye  wrote:

> Hi everyone,
>
> Based on the conversation in the last community sync and the Iceberg Slack
> channel, it seems like multiple parties have interest in continuing the
> effort related to the secondary index in Iceberg, so I would like to
> restart the thread to continue the discussion.
>
> So far most people refer to the document authored by Miao Wang
> 
> which has a lot of useful information about the design and implementation.
> However, the document is also quite old (over a year now) and a lot has
> changed in Iceberg since then. I think the document leaves the following
> open topics that we need to continue to address:
>
> 1. *scope of native index support*: what type of index should Iceberg
> support natively, how should developers allocate effort between adding
> support of Iceberg native index compared to developing Iceberg support for
> holistic indexing projects such as HyperSpace
> .
>
> 2. *index levels*: we have talked about partition level indexing and file
> level indexing. More clarity is needed for these index levels and the level
> of interest and support needed for those different indexing levels.
>
> 3. *index storage*: we had unsettled debates around making index
> separated files or embedding it as a part of existing Iceberg file
> structure. We need to come up with certain criteria such as index size,
> easiness to generate during write, etc. to settle the discussion.
>
> 4. *Indexing process*: as stated in Miao's document, indexes could be
> created during the data writing process synchronously, or built
> asynchronously through an index service. Discussion is needed for the focus
> of the Iceberg index functionalities.
>
> 5. *index invalidation*: depends on the scope and level, certain indexes
> need to be invalidated during operations like RewriteFiles. Clarity is
> needed in this domain, including if we need another sequence number to
> track such invalidation.
>
> I suggest we iterate a bit on this list of open questions, and then we can
> have a meeting to discuss those aspects, and produce an updated document
> addressing those aspects to provide a clear path forward for developers
> interested in adding features in this domain.
>
> Any thoughts?
>
> Best,
> Jack Ye
>
>

-- 
Ryan Blue
Tabular


Re: [VOTE] Release Apache Iceberg 0.13.0 RC1

2022-01-25 Thread Kyle Bendickson
Thank you, Jack!

Quick announcement when testing: *the runtime jars / artifacts for Spark &
Flink have changed naming format *to include the corresponding Spark /
Flink version. The Spark jars also have the Scala version appended at the
end.

*Spark:*
You can test the 0.13.0-rc1, fetching it from the staging maven repository,
with the following command line flags for Spark 3.2: `--packages
'org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.0' --repositories
https://repository.apache.org/content/repositories/orgapacheiceberg-1079/`

For other Spark versions than 3.2, use the artifactIds below (in place of
`iceberg-spark-runtime-3.2_2.12` above).

*iceberg-spark-runtime artifact names as of 0.13.0:*
Spark 3.0: `iceberg-spark3-runtime:0.13.0`
Spark 3.1: `iceberg-spark-runtime-3.1_2.12`
Spark 3.2: `iceberg-spark-runtime-3.2_2.12`

The complete package name for any depends on your spark version now.
`iceberg-spark3-runtime` should only be used for Spark 3.0.

*Flink:*
*iceberg-flink-runtime artifact names as of 0.13.0:*
1.12: iceberg-flink-runtime-1.12
1.13: iceberg-flink-runtime-1.13
1.14: iceberg-flink-runtime-1.14

Thank you and happy testing!
- Kyle



On Tue, Jan 25, 2022 at 9:09 AM Jack Ye  wrote:

> Hi Everyone,
>
> I propose that we release the following RC as the official Apache Iceberg
> 0.13.0 release.
>
> The commit ID is ca8bb7d0821f35bbcfa79a39841be8fb630ac3e5
> * This corresponds to the tag: apache-iceberg-0.13.0-rc1
> * https://github.com/apache/iceberg/commits/apache-iceberg-0.13.0-rc1
> *
> https://github.com/apache/iceberg/tree/ca8bb7d0821f35bbcfa79a39841be8fb630ac3e5
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.13.0-rc1
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on Nexus. The Maven repository URL
> is:
> *
> https://repository.apache.org/content/repositories/orgapacheiceberg-1079/
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 0.13.0
> [ ] +0
> [ ] -1 Do not release this because...
>


Continuing the Secondary Index Discussion

2022-01-25 Thread Jack Ye
Hi everyone,

Based on the conversation in the last community sync and the Iceberg Slack
channel, it seems like multiple parties have interest in continuing the
effort related to the secondary index in Iceberg, so I would like to
restart the thread to continue the discussion.

So far most people refer to the document authored by Miao Wang

which has a lot of useful information about the design and implementation.
However, the document is also quite old (over a year now) and a lot has
changed in Iceberg since then. I think the document leaves the following
open topics that we need to continue to address:

1. *scope of native index support*: what type of index should Iceberg
support natively, how should developers allocate effort between adding
support of Iceberg native index compared to developing Iceberg support for
holistic indexing projects such as HyperSpace
.

2. *index levels*: we have talked about partition level indexing and file
level indexing. More clarity is needed for these index levels and the level
of interest and support needed for those different indexing levels.

3. *index storage*: we had unsettled debates around making index separated
files or embedding it as a part of existing Iceberg file structure. We need
to come up with certain criteria such as index size, easiness to generate
during write, etc. to settle the discussion.

4. *Indexing process*: as stated in Miao's document, indexes could be
created during the data writing process synchronously, or built
asynchronously through an index service. Discussion is needed for the focus
of the Iceberg index functionalities.

5. *index invalidation*: depends on the scope and level, certain indexes
need to be invalidated during operations like RewriteFiles. Clarity is
needed in this domain, including if we need another sequence number to
track such invalidation.

I suggest we iterate a bit on this list of open questions, and then we can
have a meeting to discuss those aspects, and produce an updated document
addressing those aspects to provide a clear path forward for developers
interested in adding features in this domain.

Any thoughts?

Best,
Jack Ye


[VOTE] Release Apache Iceberg 0.13.0 RC1

2022-01-25 Thread Jack Ye
Hi Everyone,

I propose that we release the following RC as the official Apache Iceberg
0.13.0 release.

The commit ID is ca8bb7d0821f35bbcfa79a39841be8fb630ac3e5
* This corresponds to the tag: apache-iceberg-0.13.0-rc1
* https://github.com/apache/iceberg/commits/apache-iceberg-0.13.0-rc1
*
https://github.com/apache/iceberg/tree/ca8bb7d0821f35bbcfa79a39841be8fb630ac3e5

The release tarball, signature, and checksums are here:
* https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.13.0-rc1

You can find the KEYS file here:
* https://dist.apache.org/repos/dist/dev/iceberg/KEYS

Convenience binary artifacts are staged on Nexus. The Maven repository URL
is:
* https://repository.apache.org/content/repositories/orgapacheiceberg-1079/

Please download, verify, and test.

Please vote in the next 72 hours.

[ ] +1 Release this as Apache Iceberg 0.13.0
[ ] +0
[ ] -1 Do not release this because...