[GitHub] [iceberg] rdblue commented on pull request #3188: Add common view format spec

GitBox Thu, 07 Oct 2021 16:41:30 -0700


rdblue commented on pull request #3188:
URL: https://github.com/apache/iceberg/pull/3188#issuecomment-938223622



   @jacques-n, thanks for the example. I was going to ask for one to clarify 
about SQL resolution but you added one first!
   
   > What is expected behavior in the case of a Iceberg view?
   
   This is a case I hadn't considered much because I thought that table 
resolution would not change based on the existence of tables.
   
   The resolution in Trino never has this case because it always needs 3 parts:
   * If there are 3 parts, they are (catalog, schema, table)
   * If there are 2 parts, they are (schema, table) and the current catalog is 
used
   * If there is 1 part, it is the table and the current catalog and schema are 
used
   
   We also specifically avoided this ambiguity in Spark when building the 
multi-part identifier support. We created rules so that there is only ever one 
way to resolve a table reference:
   1. If the name is a single identifier, use it as the table name, use the 
current catalog, and current namespace from the session.
       (Otherwise, the name is multiple parts)
   2. If the first part is a catalog, use that catalog. Use the last identifier 
part as the table name and any remaining parts as the namespace.
   3. If the first part is not a catalog, use the current catalog from the 
session. Use the last identifier part as the table name and the remaining parts 
(including the first that was not a catalog) as the namespace.
   
   For a fixed set of catalogs, these rules provide an unambiguous table 
reference. There is some ambiguity if the set of catalogs changes, but that's a 
rare event because it is something done primarily by admins rather than 
individual users (like creating tables). We decided against the idea of falling 
back to trying the other branch if using rule 2 or rule 3 failed to load a 
table.
   
   For the view spec, we have tried to keep enough information that resolution 
is always done the same way. That's why the current catalog and namespace are 
stored as part of the view definition. We could also store the set of catalogs 
defined, but if there are some SQL systems that don't have an unambiguous 
resolution (meaning not based on the existence of _tables_) then it might not 
matter and we should decide how to handle it.
   
   The argument against requiring a resolved plan is that it may not be 
possible to produce one. I think Spark in particular used to store resolved 
plans but gave up on being able to produce them at all. They're also much 
longer and tended to cause views to break by being truncated in a database 
column even faster. And if we have a substrait plan, we probably don't need the 
resolved SQL.
   
   I think my preference is to allow storing multiple optional representations. 
We know that we will need the original SQL for SQL-configured views. And we 
know that storing a standard query plan (like Substrait) is preferred in the 
long term over even a resolved SQL string because of dialect problems. Since 
even the SQL may be optional for views created directly as portable plans, I 
think it makes sense to make all of these representations possible and optional.
   
   This is looking more like @jackye1995's original suggestion to be able to 
store different SQL for a given dialect. Maybe we should reconsider that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #3188: Add common view format spec

Reply via email to