Today in PyIceberg, we have support for identifier parsing in public APIs belonging to two different classes:
- Catalog class: load_table, purge_table, drop_table - Table class: scan These APIs currently have optional support for the identifier that the instance itself belongs to. For example, the catalog class APIs support: *catalog = load_catalog(“animals”, **properties)catalog.load_table(“cats.whiskers”)* But it also supports: *catalog.load_table(“animals.cats.whiskers”)* Which is redundant, because the catalog.name is already “animals”. Similarly, row_filter in the Table scan API supports: *table = catalog.load_table(“cats.whiskers”)table.scan(row_filter=”n_legs == 4”)* But we also support *table.scan(row_filter=”whiskers.n_legs == 4”)* Which is also redundant, because the table name is already “whiskers” (or cats.whiskers) While it sounds like a good change, I’d still like to open this thread to discuss the possibility of removing this optional support for the instance-level identifier as it will result in a backward incompatible API behavior change. The benefits of this change are as follows: - As observed above, specifying instance-level identifier in these APIs is redundant - This optional support adds a lot of complexity to the code base and leads to issues like: #742 <https://github.com/apache/iceberg-python/issues/742> It would be really great to clean this up before as we prepare for a 1.0.0 later this year - The optional support opens up the possibility of resulting in correctness issues if there exists a name in the level below as the instance-level identifier. - For example, if in the above catalog, we have a table namespace named “animals.lower” catalog.load(“animals.lower.cats”) can be construed as table name “cats” in the namespace “animals.lower” but it will be interpreted as table name “cats” in the namespace “lower” which is erroneous. - We would see a similar issue with tables and field names as well. Field name parsing is already complicated because we have to represent nested fields as flat representations. So it would be great to remove one unnecessary level of complication here I'd love to hear from the community on their thoughts on this topic. If there are any folks in the community using the optional feature, it would be especially great to hear from you as well, on what this change will mean for your applications. Related PR: #963 <https://github.com/apache/iceberg-python/pull/963> Sung