GitHub user yuqi1129 added a comment to the discussion: [DISCUSS] Long-term
architecture for Lance support in Gravitino
@FANNG1
Quick observation + question before we go deeper into the three-backend split:
For all three backends (path / rest / gravitino), the ultimately reliable
metadata is whatever you read from the dataset's location. Lance is
path-self-describing — unlike Iceberg, it does not need an external metadata
pointer to locate the schema; opening the path gives you schema, versions,
fragments. And since applications can (and routinely do) reach the storage
directly — Python notebooks doing `lance.dataset("s3://...")`, ETL jobs writing
fragments, training pipelines opening shards — they will bypass Lance REST
entirely. Any catalog-held metadata is at best a cache or projection of what
the path already says. The location-resolved view is the universal fallback /
reconciliation source in every backend.
The Gravitino Lance REST as it stands today already runs into this in several
concrete ways:
1. **No way to manage existing Lance tables.** For a dataset already living at
some path, today we can only register the path; there is no story for adopting
the dataset as a first-class catalog table with its versions, schema, history
surfaced through the catalog.
2. **Multi-writer / out-of-band write drift.** If a user writes directly to the
dataset path (which Lance encourages), the Lance REST service's metadata view
is not refreshed in time, so reads going through REST diverge from what the
path actually holds.
3. **Surrounding gaps.** Authentication / authorization, governance (lineage,
tags, owners), engine integration (Spark / Ray / Daft), and general usability
all still need to be defined per backend.
Given the above, could you spell out the concrete primary use case for each
backend — specifically what each one gives users that the other two do not, and
how each handles the "an app wrote directly to the path and bypassed me" case?
Would be good to have one paragraph per backend in the proposal that names the
target user, the differentiator, and the bypass story.
GitHub link:
https://github.com/apache/gravitino/discussions/11295#discussioncomment-17102174
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]