Today in PyIceberg, we have support for identifier parsing in public APIs
belonging to two different classes:


   - Catalog class: load_table, purge_table, drop_table
   - Table class: scan


These APIs currently have optional support for the identifier that the
instance itself belongs to.

For example, the catalog class APIs support:


*catalog = load_catalog(“animals”,
**properties)catalog.load_table(“cats.whiskers”)*

But it also supports:

*catalog.load_table(“animals.cats.whiskers”)*

Which is redundant, because the catalog.name is already “animals”.

Similarly, row_filter in the Table scan API supports:


*table = catalog.load_table(“cats.whiskers”)table.scan(row_filter=”n_legs
== 4”)*

But we also support


*table.scan(row_filter=”whiskers.n_legs == 4”)*
Which is also redundant, because the table name is already “whiskers” (or
cats.whiskers)

While it sounds like a good change, I’d still like to open this thread to
discuss the possibility of removing this optional support for the
instance-level identifier as it will result in a backward incompatible API
behavior change.

The benefits of this change are as follows:

   - As observed above, specifying instance-level identifier in these APIs
   is redundant
   - This optional support adds a lot of complexity to the code base and
   leads to issues like: #742
   <https://github.com/apache/iceberg-python/issues/742> It would be really
   great to clean this up before as we prepare for a 1.0.0 later this year
   - The optional support opens up the possibility of resulting in
   correctness issues if there exists a name in the level below as the
   instance-level identifier.
      - For example, if in the above catalog, we have a table namespace
      named “animals.lower” catalog.load(“animals.lower.cats”) can be construed
      as table name “cats” in the namespace “animals.lower” but it will be
      interpreted as table name “cats” in the namespace “lower” which is
      erroneous.
      - We would see a similar issue with tables and field names as well.
      Field name parsing is already complicated because we have to represent
      nested fields as flat representations. So it would be great to remove one
      unnecessary level of complication here


I'd love to hear from the community on their thoughts on this topic. If
there are any folks in the community using the optional feature, it would
be especially great to hear from you as well, on what this change will mean
for your applications.

Related PR: #963 <https://github.com/apache/iceberg-python/pull/963>

Sung

Reply via email to