Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

Miao Wang Wed, 03 Mar 2021 20:31:10 -0800

It works for me.

With a quick thought, there may be a few concerns about consolidated fashion 
storage.


1). Maintaining the consolidated storage may be a bit more complex;
2). It may make collecting index while writing data file (i.e., online index 
building) more complex (e.g., we need to consider that multiple writers write 
to the same consolidated index file in parallel);
3). We need to have some auxiliary structure in the index file to quickly 
locate relevant index given some key (e.g., a data file name);

However, I do think consolidated fashion storage is some meaningful 
optimization on the disk. If we properly design splitable and mergeable index 
file format, the consolidation fashion and 1-data-file-1-index (1:1 index file) 
are not mutual exclusive. Therefore, 1:1 index file can be the building block 
for larger consolidated index files and index at different levels, like 
partition level index.

Our team member went through one pass of the design and shared some thoughts 
with me. I will complete my pass.

Thanks!

Miao


From: Ryan Blue <rb...@netflix.com.INVALID>
Date: Wednesday, March 3, 2021 at 6:08 PM
To: OpenInx <open...@gmail.com>
Cc: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: Secondary Indexes - Pluggable File Filter interface for Apache 
Iceberg
Great, thank you for planning to join! I definitely want to get your input on 
this as well.

On Wed, Mar 3, 2021 at 6:06 PM OpenInx 
<open...@gmail.com<mailto:open...@gmail.com>> wrote:
It will be  1:00 AM (China Standard Time) on 18 March,  and it works for our 
Asia people.   I'd love to attend this discussion, Thanks.

On Thu, Mar 4, 2021 at 9:50 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
Thanks for putting this together, Guy! I just did a pass over the doc and it 
looks like a really reasonable proposal for being able to inject custom file 
filter implementations.

One of the main things we need to think about is how to store and track the 
index data. There's a comment in the doc about storing them in a "consolidated 
fashion" and I'd like to hear more about what you're thinking there. The 
index-per-file approach that Adobe is working on is a good way to track index 
data because we get a clear lifecycle for index data because it is tied to a 
data file that is immutable. On the other hand, the drawback is that we have a 
lot of index files -- one per data file.

Let's set up a time to go talk through the options. Would 9AM PST (17:00 UTC) 
on 17 March work for everyone? I'm thinking in the morning so everyone from IBM 
can attend. We can do a second discussion at a time that works more for people 
in Asia later on as well.

If that day works, then I'll send out an invite.

On Fri, Feb 19, 2021 at 8:49 AM Guy Khazma 
<guyk...@gmail.com<mailto:guyk...@gmail.com>> wrote:
Hi All,

Following up on our discussion from Wednesday sync here attached is a proposal 
to enhance iceberg with a pluggable interface for data skipping indexes to 
enable use of existing indexes in job planning.

https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit?usp=sharing<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY%2Fedit%3Fusp%3Dsharing&data=04%7C01%7Cmiwang%40adobe.com%7C9ce4b2e7876c4e23a8ac08d8deb26ffc%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637504205348408643%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vFOaNdSwCYQO1p%2FDeX5glae%2BSo9aOF3S%2BR2bU2O1tM0%3D&reserved=0>

We will be glad to get you feedback.

Thanks,
Guy


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

Reply via email to