Yeah I think Iceberg and Hive are the only ones trying to make life
difficult, that I think
we should also cover but in changes to the Iceberg Spec. Hive can just stay
how it is ...

On Mon, May 19, 2025 at 2:59 PM Dmitri Bourlatchkov <di...@apache.org>
wrote:

> For context: my locations concerns are rooted in Nessie's experience where
> we often get problem reports related to files being outside the declared
> Iceberg metadata location.
>
> Example:
>
> https://github.com/projectnessie/nessie/issues/10817#issuecomment-2887329227
>
> I'm ok going with a single location for generic tables, but I think Polaris
> needs to have a more strict spec for that (define where file should and
> should not go) because polaris owns this spec. Polaris ought to define what
> complies with the spec and what does not. Having a proper spec is essential
> to ensure a mutual understanding of all parties dealing with Generic
> Tables.
>
> Open API yaml comments are not sufficient, IMHO. I'd prefer to have a
> dedicated doc page to define expectations and compliance.
>
> Thanks,
> Dmitri.
>
>
> On Mon, May 19, 2025 at 2:17 PM Russell Spitzer <russell.spit...@gmail.com
> >
> wrote:
>
> > The only multiple locations table formats I'm currently aware of are Hive
> > (partitions can live wherever) and Iceberg.
> >
> >  I think for Delta, Hudi, LanceDB, Paimon and File based tables they all
> > have to live in the root location. I'm not sure of any other "file" based
> > tables where this would be an issue but I'd love to know if someone else
> > has ideas. I think with the rise in credential vending, splitting things
> > amongst multiple prefixes is becoming less common. I don't oppose doing
> an
> > array of locations but it may be enough to just leave this as an
> extension
> > later. (Support location or locations)
> >
> > On Wed, May 7, 2025 at 8:52 PM yun zou <yunzou.colost...@gmail.com>
> wrote:
> >
> > > Hi Dmitri,
> > >
> > > If it's not "all" is it not strong enough for a spec, IMHO. If some
> > tables
> > > have multiple base locations how is Polaris going to deal with them?
> > >
> > > Sorry, when I say most of them, it was because I haven't tested all of
> > them
> > > (I only tested Delta and CSV before).
> > > However, if Unity Catalog is only taking one location, I think that is
> a
> > > strong enough proof that
> > > one location is enough today.
> > >
> > > It is also more natural to start with one location, and if there are
> use
> > > cases that
> > > require support for multiple locations later, we can move on to V2 spec
> > to
> > > support multiple
> > > tables locations.
> > >
> > > We're making a specification for Polaris. I do not think it is
> sufficient
> > > to say we'll do the same as other (unspecified ATM) catalogs.
> > > If we want to migrate users from other Catalog services to Polaris
> > (through
> > > federation), then Polaris will need to
> > > provide corresponding capabilities.  For example, Unity Catalog storage
> > > location is a URI representation, when entity
> > > are federated from Unity Catalog, we will need to be able to handle the
> > URI
> > > location.
> > > If URI representation is a common standard that has been accepted by
> > other
> > > Catalog services like Unity Catalog, Gravitino,
> > > Polaris should be compatible with that, otherwise it might cause
> problem
> > > for users when they are migrating from one to
> > > another.
> > >
> > > What will Polaris Server do with this location?
> > > For generic tables, Polaris will provide credential vending for this
> > > location in near future, I don't see we will provide
> > > anything else in short or mid term, since we still want to promote
> > > native support for Iceberg.
> > > Or if you have anything special in your mind that you think we should
> > > support?
> > >
> > > If Polaris has to define it in a spec, it will be hard to change in the
> > > future.
> > > Regardless of whether it is explicitly in the spec definition or as a
> > > reserved property key, as long as they are explicitly
> > > documented, they will be hard to change in the future. From that
> > > perspective, those two approaches seem the same to me.
> > >
> > > Table location is critical information that is required by the engine
> > side
> > > to read and write the tables, which should
> > > be explicitly defined to provide better sharing across engines. For
> > > example, the delta table location is passed in the
> > > table properties with a property key either "location" or "path"
> depends
> > on
> > > how the table is created. Now, if another
> > > engine wants to read the delta table, it will need to understand those
> > > keys, which are controlled by Spark today. If Spark
> > > changes them one day, all sharing will stop working.
> > >
> > > As to whether we want to put it as an explicit field or a reserved
> key, I
> > > think for a common field among various
> > > table formats, it makes more sense to have it as an explicit field. For
> > > properties that are specific to a particular table format,
> > > it is more proper to just have a reserved key.
> > >
> > > If Polaris takes control of the location, I think we have to be more
> > > careful
> > > and at least try to make it future-proof.
> > >
> > > I don't think Polaris is taking control of the location, the location
> is
> > > still controlled by the engine and users today like table names.
> > > Polaris is a Catalog service, it records the generic table entity, and
> > > returns the information back to the user on query.
> > > It might be able to do some validation on the location (like check
> > special
> > > character), but it doesn't decide which location
> > > the table will be used. I personally don't think it is a bad idea to
> let
> > > the Catalog service also take control of generating
> > > the table location, but I think that will require a lot of work.
> > >
> > > Best Regards,
> > > Yun
> > >
> > > On Wed, May 7, 2025 at 5:22 PM Dmitri Bourlatchkov <di...@apache.org>
> > > wrote:
> > >
> > > > No worries about the name. It is a possible alternative spelling :)
> > > >
> > > > On Wed, May 7, 2025 at 8:04 PM yun zou <yunzou.colost...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Dmitri,
> > > > >
> > > > > Sorry, I accidentally typed your name wrong in the previous reply!
> > > > > Apologize for this!
> > > > >
> > > > > For the S3 issue, I think we will need to deal with those
> regardless,
> > > > > especially with the federation work going on, we will need to
> handle
> > > all
> > > > > those entities eventually coming from different Catalogs, and the
> URI
> > > > > format seems the standard format used by various Catalog services.
> > > > >
> > > > > Best Regards,
> > > > > Yun
> > > > >
> > > > > On Wed, May 7, 2025 at 4:55 PM yun zou <yunzou.colost...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi Dimitri and Eric,
> > > > > >
> > > > > > Thanks a lot for the feedback!
> > > > > >
> > > > > > For the questions:
> > > > > > - Is one value or many?
> > > > > > It will be one value, similar to the location in Iceberg and the
> > > > > > storage_location in unity catalog.
> > > > > >
> > > > > > Regarding to the point about having new data in new locations and
> > > > keeping
> > > > > > old data in old locations, do we support that for Iceberg
> > > > > > today?
> > > > > > For most of the Spark tables, it seems to only have one location.
> > > > Also, I
> > > > > > think it is better to start restricted first, and then extend it
> to
> > > > > > allow multiple locations when the use case raises.
> > > > > >
> > > > > > Ref:
> > > > > > Iceberg location:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451
> > > > > > Storage location in Unity Catalog:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451
> > > > > >
> > > > > > - Is it a URI?
> > > > > > Yes, it will be a URI, which seems the standard catalog
> > > implementation.
> > > > > > Regarding to the point about s3 v2 s3a, i assume that is a common
> > > > > > problem that every catalog implementation needs to address, and
> we
> > > will
> > > > > > stay the same on this part. At least from the load table point of
> > > view,
> > > > > > Spark engine knows how to deal with such cases.
> > > > > >
> > > > > > - Does it point to any particular file?
> > > > > > No, it doesn't point to a particular file. It is the base table
> > > > location.
> > > > > >
> > > > > > - Is it a common prefix of all files within a table?
> > > > > > It is supposed to be the base table location, which theoretically
> > > > should
> > > > > > be the common prefix of all files within a table I believe.
> > > > > >
> > > > > > - What happens when a value does not match these expectations?
> > > > > > Whether it is one value or many is restricted by the spec
> already.
> > > > > > For URI format, I think we can do a format check, and fail it.
> > > > > > Other than that, we will not do any other special check, and we
> > rely
> > > on
> > > > > > the client to put the correct value, otherwise, the other engine
> > will
> > > > > > not be able to successfully read the table.
> > > > > >
> > > > > > For the location keyword, as Eric has pointed out, we can
> > potentially
> > > > > have
> > > > > > a reserved key for the properties. However, location is a common
> > > > > > enough key among various table formats, which worths a dedicated
> > key
> > > to
> > > > > > help store and load the information in a more straightforward
> > > > > > way.  For things that are specific to one or two formats, I think
> > it
> > > > > makes
> > > > > > more sense to use a reserved property key.
> > > > > >
> > > > > > As a reference, in Iceberg, the CreateTable request and
> > TableMetadata
> > > > > does
> > > > > > have an explicit location key in the spec. For write.data.path
> > > > > > and write.metadata.path, they are passed as properties today.
> > > > > >
> > > > > > Best Regards,
> > > > > > Yun
> > > > > >
> > > > > >
> > > > > > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov <
> > di...@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Another point: I'm pretty sure sooner or later users will want
> to
> > > move
> > > > > >> their data to some other location. As an option users may want
> to
> > > > write
> > > > > >> new
> > > > > >> files into another location but keep old files in place.
> > > > > >>
> > > > > >> Also: if the location is a URI, how do we deal with s3 vs. s3a
> for
> > > > > >> example?
> > > > > >>
> > > > > >> In Iceberg it is quite common for different engines to use
> > different
> > > > > >> access
> > > > > >> tools, which often leads to different URI schemes.
> > > > > >>
> > > > > >> Cheers,
> > > > > >> Dmitri.
> > > > > >>
> > > > > >> On Wed, May 7, 2025 at 6:46 PM Eric Maynard <
> > > eric.w.mayn...@gmail.com
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > All good questions Dmitri — I’m especially interested in the
> > first
> > > > one
> > > > > >> as
> > > > > >> > from what I understand Iceberg tables can have metadata and
> data
> > > at
> > > > > two
> > > > > >> > different paths that we need to vend credentials for.
> > > > > >> >
> > > > > >> > For iceberg tables, we just use special properties to track
> > these
> > > > > >> > locations. I wonder if we couldn’t do the same for generic
> > tables.
> > > > > >> >
> > > > > >> > On Wed, May 7, 2025 at 3:42 PM Dmitri Bourlatchkov <
> > > > di...@apache.org>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Hi Yun,
> > > > > >> > >
> > > > > >> > > Please clarify the meaning of the value of the new location
> > > > > attribute.
> > > > > >> > >
> > > > > >> > > - Is is one value or many?
> > > > > >> > > - Is it a URI?
> > > > > >> > > - Does it point to any particular file?
> > > > > >> > > - Is it a common prefix of all files within a table?
> > > > > >> > > - What happens when a value does not match these
> expectation?
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Dmitri.
> > > > > >> > >
> > > > > >> > > On 2025/05/07 21:50:19 yun zou wrote:
> > > > > >> > > > Hi folks,
> > > > > >> > > >
> > > > > >> > > > I would like to propose to add an optional `location`
> field
> > to
> > > > > >> > > > CreateGenricTable Request and LoadGenericTable response.
> > > > > >> > > >
> > > > > >> > > > The `location` is the location for the table, which is
> > common
> > > to
> > > > > >> most
> > > > > >> > > table
> > > > > >> > > > formats including Iceberg, Delta, Hudi, csv, parquet etc.
> > The
> > > > > >> location
> > > > > >> > > > information is critical for loading the table at engine
> > side,
> > > > > >> having a
> > > > > >> > > > dedicated keyword could help improve the robustness for
> > cross
> > > > > engine
> > > > > >> > > > sharing, instead of relying on the properties passed by
> the
> > > > client
> > > > > >> > side.
> > > > > >> > > >
> > > > > >> > > > Furthermore, this information is also required to provide
> > > > > credential
> > > > > >> > > > vending capabilities later.
> > > > > >> > > >
> > > > > >> > > > Here is the PR for adding the spec:
> > > > > >> > > > https://github.com/apache/polaris/pull/1543
> > > > > >> > > >
> > > > > >> > > > Looking forward to your reply and feedback!
> > > > > >> > > >
> > > > > >> > > > Best Regards,
> > > > > >> > > > Yun
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to