Re: [PR] [cli] paimon cli [paimon]

via GitHub Wed, 27 May 2026 19:00:05 -0700


hqlalala commented on PR #7975:
URL: https://github.com/apache/paimon/pull/7975#issuecomment-4560197882


   Thanks for the thorough review! Addressing each point:
   
     ### 1. Overlap with Python CLI
   
     The two CLIs serve different user segments, not a duplication problem. The 
Python CLI targets data scientists with `pip` environments; the Java CLI 
targets DevOps/SREs on production clusters where JDK is always available but 
Python often isn't.
   
     The feature sets are also not 1:1 — the Java CLI includes ops-oriented 
commands (expire-snapshots, orphan-clean, tag management, rollback) that the  
Python CLI doesn't have. This mirrors how Flink, Spark, and Hive each maintain 
their own CLI despite overlapping SQL capabilities.
   
     ### 2. Calcite dependency
   
     Calcite 1.37.0 itself adds ~8MB — the cost for a full SQL planner with 
JOIN, subquery, window functions, and UNION support. This parallels the Python  
CLI's use of DataFusion as its SQL engine.
   
     Final jar size: **70MB** (core only). S3/OSS cloud storage plugins are 
**not bundled** — they're loaded as optional plugins via install script  
(`./install.sh --with oss`), same model as Python CLI's `pip install 
paimon[oss]`.
   
     Calcite runs embedded in the CLI process with its own classloader. It 
never touches the user's Hadoop/Hive classpaths, so there are no version  
compatibility concerns with the cluster environment.
   
     ### 3. Module structure
   
     `paimon-python/` is already a top-level module in the main repo and part 
of releases. `paimon-cli` follows the same pattern. It has zero dependency on  
Flink/Spark and won't affect their release cycles.
   
     I'm open to adding `<maven.deploy.skip>true</maven.deploy.skip>` initially 
so it's built and tested in CI but not published to Maven Central until it  
stabilizes.
   
     ### 4. SQL engine completeness
   
     The CLI is a standalone diagnostic/management tool, not a Flink SQL or 
Spark SQL replacement. The `sql` subcommand explicitly states it uses Calcite's 
SQL dialect (configured with `"lex": "MYSQL"` mode, same as Python CLI's 
DataFusion default).
   
     Calcite covers standard SQL which is sufficient for the CLI's use case: 
ad-hoc queries for table inspection and debugging. Users who need  
engine-specific syntax (e.g., Flink's temporal joins, Spark's TRANSFORM) should 
use those engines directly.
   
     ### 5. Write path
   
     `WriteCommand` uses the standard Paimon Java API: 
`table.newBatchWriteBuilder()` → `BatchTableWrite` → `prepareCommit()` → 
`commit()`. This goes  through the same `FileStoreCommit` with snapshot-based 
atomic commit that Flink and Spark use internally.
   
     The CLI is designed for single-user batch imports (CSV/JSON → table), not 
concurrent streaming writes — same scope as Python CLI's `paimon table  import` 
command, which also uses `write_builder.new_commit().commit()`.
   
     ### 6. PIP
   
     I wasn't aware this required a PIP — `paimon-python/` (including its CLI, 
DataFusion SQL engine, and write path) was added to the main repo without  one. 
If the community feels a PIP is needed for this module, I'm happy to draft one 
and start a discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [cli] paimon cli [paimon]

Reply via email to