Hi there,
Please find the poc pull request here:
https://github.com/apache/iceberg/pull/3177

Thanks,
Qinhua Yan

From: Qinhua Yan
Sent: Monday, August 30, 2021 10:22 AM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: RE: Enhanced JdbcCatalog with Namespace Management

I’m really glad to see there are so many interests! ☺

I cannot wait to create the PR but honestly we just had the first working 
Catalog. I wanted to learn what people think. I’ll post in this thread when PR 
is ready.

Before that, let me try to answer some of the questions.


·       We created a separated sql table for namespace to facilitate easier and 
faster lookups/configuration.

o   Each namespace has a location uri that points to a S3 bucket, which is the 
default location of tables created under that namespace.

o   It is possible for a namespace to have multiple backends.

·       Our users are very familiar to namespace-level permission management. 
Most of namespaces are restrict (both read and write) to only certain group of 
people.

o   Users with access to the Catalog DB are catalog owners. They are not 
necessarily, and are often not, the data owners. For example, right now the 
major client of our Catalog is a data-access service. It is the owner of the 
Catalog but not the Tables.

o   Technically, our Catalog supports rename-across-namespaces in the same way 
that it supports rename-within-namespace. However, for us this is very rare as 
explained above.

o   For the same reason, our users are familiar with uri identifiers like 
“iceberg://mynamespace/mytable”. What I meant by isolation of logical table 
name from physical is that, user will never see or need to understand 
“iceberg://mynamespace/mytable-uuid16”. I understand that this is a feature 
common to existing Catalog impls.

·       SQL compatibility

o   (This almost looks an ad) but Jooq<https://www.jooq.org/> is like a 
type-safe generic SQL db connector that supports multiple SQL types. In other 
words, the same Java code implementing a SQL query can work with different 
types of db.

o   That said, I totally understand your concern about bringing-in additional 
dependency and we are ok to switch to plain jdbc driver if you want to keep the 
core code independent.

o   Agree that SQL db initialization can be added to doc.

·       Features not ready but we want to work on next (that will involve the 
Catalog) include

o   Undelete,

o   and very similar logic as undelete: register pre-existing table to Catalog.


Thanks,
Qinhua Yan

From: Jack Ye <yezhao...@gmail.com<mailto:yezhao...@gmail.com>>
Sent: Sunday, August 29, 2021 1:01 PM
To: Iceberg Dev List <dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>>
Subject: Re: Enhanced JdbcCatalog with Namespace Management

Hi Qinhua,

+1 for what Ryan says, it would be great to have a PR to analyze the features 
you list in detail. I have the following questions and comments:

> Namespace management and configuration
This is something we decided to implement in a second iteration when people 
have a need, so if you have a desire to add namespace feature it would be great 
to directly add it to the existing JdbcCatalog, I can review in more details 
when you have the PR out.

> Each namespace can be backed by a different S3 bucket. This allows fine 
> grained access control at the namespace level.
This is common feature across all catalog implementations that supports 
namespace. Typically a LocationUri is stored as namespace property, which can 
be used to override default table location in that namespace.

> At namespace creation time, users can choose either 1) use a pre-existing 
> bucket; 2) let the Catalog create a new bucket.
I think we need to make this more generic, likely with the mechanism I 
described that is used for all other catalog implementations. Bucket is a 
resource with a maximum cap, so it would not scale well for multi-tenant use 
case for users with an unbounded number of logical namespaces.

> Isolate logical TableIdentifiers from physical S3 locations.
> Support rename table within the same namespace without touching S3.
Iceberg table has UUID, I think these are achievable directly based on Iceberg 
catalog and storage design. We should also be bale to rename table across 
namespace. Could you describe in more details for what feature is added here?

> Support various kinds of databases
Is there anything that the current implementation not support? I don't think it 
uses any SQL dialect that is incompatible across databases, but maybe I 
overlooked something here.

> Use Jooq <https://www.jooq.org/> to connect to the database and to ensure SQL 
> semantics.
I think we should evaluate a little bit about the difference of this library 
and the current implementation. I checked that the license of the library is 
fine, could you describe a bit why this is a better choice than the existing 
implementation?
We typically do not want to have too many third party dependencies. Given 
JdbcCatalog is in the core library, we should be very careful when adding new 
dependencies.

> Provide database initialization scripts for Postgres.
I think infrastructure setup like database initialization is out of scope of 
Iceberg library. We can add this as a part of the JDBC documentation.

Best,
Jack Ye






On Fri, Aug 27, 2021 at 12:54 PM Ryan Blue 
<b...@tabular.io<mailto:b...@tabular.io>> wrote:
Qinhua, thanks for sharing this. It sounds great to add more features to the 
JDBC catalog.

Could you share a link to the implementation or a PR? I have lots more 
questions like how you implemented namespaces, but those can probably be 
answered by looking at the code if you're able to share it.

Thanks!

Ryan

On Fri, Aug 27, 2021 at 11:48 AM Qinhua Yan 
<qinhua....@twosigma.com<mailto:qinhua....@twosigma.com>> wrote:

Hi there,


We’d like to share our JdbcCatalog impl with the community and welcome any 
discussion.

We are aware of the existing JdbcCatalog 
impl<https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java>,
 however, it has some feature gaps and doesn’t work for our use case. 
Therefore, we implemented a SQL-database backed Catalog with the following 
enhancements.

1.       Namespace management and configuration

•       Each namespace can be backed by a different S3 bucket. This allows fine 
grained access control at the namespace level.

•       At namespace creation time, users can choose either 1) use a 
pre-existing bucket; 2) let the Catalog create a new bucket.

•       Isolate logical TableIdentifiers from physical S3 locations.

•       Support rename table within the same namespace without touching S3.

2.       Support various kinds of databases

•       Use Jooq <https://www.jooq.org/> to connect to the database and to 
ensure SQL semantics.

•       Easy to support different kinds of SQL without touching the core 
Catalog code.

•       Provide database initialization scripts for Postgres.



This Catalog implementation can be easily extended to support some advanced 
features such as undelete tables and namespace-backed-by-multiple-backends.



Any comments and discussions are welcomed!


Thank you!
Qinhua Yan


--
Ryan Blue
Tabular

Reply via email to