kosiew commented on issue #1142:
URL:
https://github.com/apache/datafusion-python/issues/1142#issuecomment-2969114420
## Context & Problem Statement
- **Current state**
- **Datafusion core repo:** uses `catalog/schema/table`
- **Datafusion Python repo:** uses `catalog/database/table`
- Other usages: `catalog/namespace/table`
- **Problem**
- The inconsistent terminology leads to confusion for users and
contributors, especially as the API matures and more complex catalog operations
become common.
- There is a need for a consistent 3-level hierarchical naming scheme
across all interfaces and repositories.
- **Why does this matter?**
- Clear, consistent naming improves user understanding and reduces errors.
- Aligning the semantics allows better interoperability and documentation
clarity.
- Since many users presently might only have one
schema/database/namespace, now is a good time to change before adoption widens.
---
## Key Terms and Their Semantic Meanings
| Term | Typical Meaning in Databases |
|------------|-------------------------------------------------------------|
| **Catalog**| The highest-level grouping, often corresponds to a data
source or cluster. |
| **Schema** | A logical grouping/container within a catalog, often
corresponding to namespaces for tables (e.g., Postgres schema). |
| **Database**| Sometimes synonymous with catalog (e.g., MySQL: the database
is a catalog), other times a level. |
| **Namespace**| A more generic term representing a logical scope that
contains tables; can be interchangeable with schema or database depending on
system. |
| **Table** | The actual table or data object. |
---
## Exploration of the Three Naming Variants
### 1. `catalog/schema/table`
- **Pros**
- Matches Datafusion core repo convention, supporting consistency in the
core project.
- Matching widespread SQL semantic usage, e.g., Oracle/Postgres where
schema = namespace under catalog.
- Clear semantic distinction: catalog as the source, schema as logical
grouping.
- **Cons**
- `schema` term might confuse users coming from MySQL or systems where
schema=database.
- In some systems, "database" is the term used instead.
### 2. `catalog/database/table`
- **Pros**
- Familiar to users from MySQL, BigQuery, and others that treat "database"
as the middle layer.
- More intuitive for newcomers who think in terms of databases rather than
schemas.
- **Cons**
- Conflicts with Datafusion core (which prefers “schema”).
- "Database" and "catalog" meanings overlap in different systems, risking
ambiguity.
### 3. `catalog/namespace/table`
- **Pros**
- Namespace is generic and can adapt to any system (equivalent to schema
or database).
- Avoids confusion by not tying to concrete DBMS terminology.
- Aligns with abstraction in distributed systems and catalog APIs.
- **Cons**
- Less immediately familiar to SQL users.
- Could add cognitive overhead if users expect more standard terms.
---
## Diagram: How these map onto a conceptual hierarchy
```plaintext
+---------------------------+
| Catalog |
| +-----------------------+ |
| | Schema / Database / | |
| | Namespace | |
| | +-------------------+ | |
| | | Table | | |
| | +-------------------+ | |
| +-----------------------+ |
+---------------------------+
```
- The middle layer is the point of ambiguity: schema / database / namespace.
---
## Recommendations
### Align with Datafusion Core
- Since the **core repo uses `catalog/schema/table`**, and Datafusion is the
source of truth, **standardizing on `catalog/schema/table`** is recommended to
reduce cognitive dissonance.
### Provide Alias or Flexibility in Python Bindings
- For the Python API and other language bindings, consider exposing aliases
or conversion helpers so the user can think in terms of databases or namespaces
if desired.
- Documentation should clearly indicate what "schema" means in Datafusion
parlance.
### Consider Long-Term Evolution
- If the ecosystem grows to support multiple systems with conflicting
terminologies, *consider introducing a configurable abstraction layer* to map
terms more flexibly. For now, keep it simple.
---
## Summary of User Impact
| User Scenario | Impact of Change to
`catalog/schema/table` |
|--------------------------------|----------------------------------------------------|
| Single-schema users | Minimal to no impact; mostly transparent
|
| Multi-schema advanced users | Gains clarity and consistency
|
| Users from MySQL-style systems | Need to adapt terminology slightly, but
this is common in cross-platform tools |
| Documentation and tooling | Greater consistency and clarity
|
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]