GitHub user yuqi1129 added a comment to the discussion: [DISCUSS] Long-term 
architecture for Lance support in Gravitino

@FANNG1

Quick observation + question before we go deeper into the three-backend split:

For all three backends (path / rest / gravitino), the ultimately reliable 
metadata is whatever you read from the dataset's location. Lance is 
path-self-describing — unlike Iceberg, it does not need an external metadata 
pointer to locate the schema; opening the path gives you schema, versions, 
fragments. And since applications can (and routinely do) reach the storage 
directly — Python notebooks doing `lance.dataset("s3://...")`, ETL jobs writing 
fragments, training pipelines opening shards — they will bypass Lance REST 
entirely. Any catalog-held metadata is at best a cache or projection of what 
the path already says. The location-resolved view is the universal fallback / 
reconciliation source in every backend.

The Gravitino Lance REST as it stands today already runs into this in several 
concrete ways:

1. **No way to manage existing Lance tables.** For a dataset already living at 
some path, today we can only register the path; there is no story for adopting 
the dataset as a first-class catalog table with its versions, schema, history 
surfaced through the catalog.
2. **Multi-writer / out-of-band write drift.** If a user writes directly to the 
dataset path (which Lance encourages), the Lance REST service's metadata view 
is not refreshed in time, so reads going through REST diverge from what the 
path actually holds.
3. **Surrounding gaps.** Authentication / authorization, governance (lineage, 
tags, owners), engine integration (Spark / Ray / Daft), and general usability 
all still need to be defined per backend.

Given the above, could you spell out the concrete primary use case for each 
backend — specifically what each one gives users that the other two do not, and 
how each handles the "an app wrote directly to the path and bypassed me" case? 
Would be good to have one paragraph per backend in the proposal that names the 
target user, the differentiator, and the bypass story.


GitHub link: 
https://github.com/apache/gravitino/discussions/11295#discussioncomment-17102174

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to