hqlalala commented on PR #7975:
URL: https://github.com/apache/paimon/pull/7975#issuecomment-4560197882
Thanks for the thorough review! Addressing each point:
### 1. Overlap with Python CLI
The two CLIs serve different user segments, not a duplication problem. The
Python CLI targets data scientists with `pip` environments; the Java CLI
targets DevOps/SREs on production clusters where JDK is always available but
Python often isn't.
The feature sets are also not 1:1 — the Java CLI includes ops-oriented
commands (expire-snapshots, orphan-clean, tag management, rollback) that the
Python CLI doesn't have. This mirrors how Flink, Spark, and Hive each maintain
their own CLI despite overlapping SQL capabilities.
### 2. Calcite dependency
Calcite 1.37.0 itself adds ~8MB — the cost for a full SQL planner with
JOIN, subquery, window functions, and UNION support. This parallels the Python
CLI's use of DataFusion as its SQL engine.
Final jar size: **70MB** (core only). S3/OSS cloud storage plugins are
**not bundled** — they're loaded as optional plugins via install script
(`./install.sh --with oss`), same model as Python CLI's `pip install
paimon[oss]`.
Calcite runs embedded in the CLI process with its own classloader. It
never touches the user's Hadoop/Hive classpaths, so there are no version
compatibility concerns with the cluster environment.
### 3. Module structure
`paimon-python/` is already a top-level module in the main repo and part
of releases. `paimon-cli` follows the same pattern. It has zero dependency on
Flink/Spark and won't affect their release cycles.
I'm open to adding `<maven.deploy.skip>true</maven.deploy.skip>` initially
so it's built and tested in CI but not published to Maven Central until it
stabilizes.
### 4. SQL engine completeness
The CLI is a standalone diagnostic/management tool, not a Flink SQL or
Spark SQL replacement. The `sql` subcommand explicitly states it uses Calcite's
SQL dialect (configured with `"lex": "MYSQL"` mode, same as Python CLI's
DataFusion default).
Calcite covers standard SQL which is sufficient for the CLI's use case:
ad-hoc queries for table inspection and debugging. Users who need
engine-specific syntax (e.g., Flink's temporal joins, Spark's TRANSFORM) should
use those engines directly.
### 5. Write path
`WriteCommand` uses the standard Paimon Java API:
`table.newBatchWriteBuilder()` → `BatchTableWrite` → `prepareCommit()` →
`commit()`. This goes through the same `FileStoreCommit` with snapshot-based
atomic commit that Flink and Spark use internally.
The CLI is designed for single-user batch imports (CSV/JSON → table), not
concurrent streaming writes — same scope as Python CLI's `paimon table import`
command, which also uses `write_builder.new_commit().commit()`.
### 6. PIP
I wasn't aware this required a PIP — `paimon-python/` (including its CLI,
DataFusion SQL engine, and write path) was added to the main repo without one.
If the community feels a PIP is needed for this module, I'm happy to draft one
and start a discussion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]