rdblue commented on pull request #3188:
URL: https://github.com/apache/iceberg/pull/3188#issuecomment-938223622
@jacques-n, thanks for the example. I was going to ask for one to clarify
about SQL resolution but you added one first!
> What is expected behavior in the case of a Iceberg view?
This is a case I hadn't considered much because I thought that table
resolution would not change based on the existence of tables.
The resolution in Trino never has this case because it always needs 3 parts:
* If there are 3 parts, they are (catalog, schema, table)
* If there are 2 parts, they are (schema, table) and the current catalog is
used
* If there is 1 part, it is the table and the current catalog and schema are
used
We also specifically avoided this ambiguity in Spark when building the
multi-part identifier support. We created rules so that there is only ever one
way to resolve a table reference:
1. If the name is a single identifier, use it as the table name, use the
current catalog, and current namespace from the session.
(Otherwise, the name is multiple parts)
2. If the first part is a catalog, use that catalog. Use the last identifier
part as the table name and any remaining parts as the namespace.
3. If the first part is not a catalog, use the current catalog from the
session. Use the last identifier part as the table name and the remaining parts
(including the first that was not a catalog) as the namespace.
For a fixed set of catalogs, these rules provide an unambiguous table
reference. There is some ambiguity if the set of catalogs changes, but that's a
rare event because it is something done primarily by admins rather than
individual users (like creating tables). We decided against the idea of falling
back to trying the other branch if using rule 2 or rule 3 failed to load a
table.
For the view spec, we have tried to keep enough information that resolution
is always done the same way. That's why the current catalog and namespace are
stored as part of the view definition. We could also store the set of catalogs
defined, but if there are some SQL systems that don't have an unambiguous
resolution (meaning not based on the existence of _tables_) then it might not
matter and we should decide how to handle it.
The argument against requiring a resolved plan is that it may not be
possible to produce one. I think Spark in particular used to store resolved
plans but gave up on being able to produce them at all. They're also much
longer and tended to cause views to break by being truncated in a database
column even faster. And if we have a substrait plan, we probably don't need the
resolved SQL.
I think my preference is to allow storing multiple optional representations.
We know that we will need the original SQL for SQL-configured views. And we
know that storing a standard query plan (like Substrait) is preferred in the
long term over even a resolved SQL string because of dialect problems. Since
even the SQL may be optional for views created directly as portable plans, I
think it makes sense to make all of these representations possible and optional.
This is looking more like @jackye1995's original suggestion to be able to
store different SQL for a given dialect. Maybe we should reconsider that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]