singhpk234 commented on PR #2048:
URL: https://github.com/apache/polaris/pull/2048#issuecomment-3084708025

   @snazy @adutra thank you for sharing your feedbacks, i want to walk you all 
through what my thought process was : 
   
   Why Iceberg Expressions and not SQL
   Iceberg expresions are portable, dialect agnostic and first class citizens 
of iceberg world and IMHO are the must for interop
   Note almost all engines have 
   **_Engine specific SQL -> Engine Expression -> Iceberg Expression -> Iceberg 
SDK (manifest filtering) pipeline already_**
   
   we did explore dialect agnostic SQL such as SQL92 for most part of it and we 
keep coming back to dialect specific requirements if it was a sql, engines want 
to store their dialect specific stuff directly in policy and hence in 
persistence, making this policy only workable to the policy definer dialect. 
what will be the behaviour if i don;t understand the dialect.
   
   What i see is we can expand Iceberg expression to contain UDF references and 
via UDF you model all you dialect specific stuff but bottom line being you just 
operate in iceberg expression in policy definition. 
   If we want dialect specific stuff we can still shove everything in UDF, Here 
is one common example : 
   sha256 in spark dialect is sha2, why can model this as hash udf and then for 
spark its sha2 and trino 256 ? 
   
   Also heads up Iceberg Expression are gonna get expanded soon due to the 
**_constraints_** for v4 work that Anton is driving (a uber level idea was 
discussed in some of the sync), essentially storing table constraints in 
iceberg metadata so that it can be retrieved and enforced by calling engine but 
the storage is still iceberg expressions, so iceberg expression is what they 
plan to use for interoperability
   
   Note : I checked in catalog community sync last time iceberg expression with 
UDFs seems like a right direction
   
   I understand UDF's are not there yet and it will take some time meanwhile 
using iceberg expression seems and storing RLS there seems like good step IMHO. 
   I know atleast one cloud provider it will for sure help = 
https://docs.aws.amazon.com/lake-formation/latest/dg/partiql-support.html
   
   
   Why not Substrait ?
   
   I think if we want subtrait, we can make UDF's return substrait directly, 
but i think in iceberg community atleast there is no **concencus** on the IR, 
as similar discussions have been brought up in the community in past for 
Iceberg Views, i would request to drive that in iceberg first and then we can 
incorporate that in policy. Also please note engines like Snowflake / Redshift 
... doesn't support substrait, unless its established as a standard IR in 
iceberg IMHO we should not take dependency on it. 
   
   
   > Spark datasets/frames are not SQL, but FGAC with those APIs is a different 
topic
   
   Yes, Non SQL requires a more through discussions imho, as for example how do 
we model the SQL written in one langue to be parsed ?
   
   I hope I am able to explain my thought process here ! I really appreciate 
you taking a look.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@polaris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to