alamb opened a new issue, #11979: URL: https://github.com/apache/datafusion/issues/11979
# TLDR * Keep `datafusion-cli` in the apache/datafusion repo * Make a new repo with a new CLI called `dfdb` (or `datafusion-cli++`or `dfcli`) which is purposely designed for running queries against a wide variety of pre-integrated sources # Problem Statement As of today, `datafusion-cli` ([docs](https://datafusion.apache.org/user-guide/cli/index.html)) serves two roles: 1. A debugging / testing tool for the DataFusion query engine developers 2. A CLI tool for actually doing useful processing if files (locally and remotely using object store), similar to the `duckdb` CLI tool It is really sweet to have a CLI that lets you query a directory of parquet files ```sql DataFusion CLI v41.0.0 > > select "WatchID", "EventDate", "URL" from hits_partitioned limit 10; +---------------------+-----------+------------------------------------------------------------------------------------------------------+ | WatchID | EventDate | URL | +---------------------+-----------+------------------------------------------------------------------------------------------------------+ | 6904841588848398438 | 15895 | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437 | ... | 7551542980199423249 | 15895 | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437 | +---------------------+-----------+------------------------------------------------------------------------------------------------------+ 10 row(s) fetched. Elapsed 0.059 seconds. ``` However, similarly to the [discussion with have had with `datafusion-pytyhon`](https://github.com/apache/datafusion-python/issues/440) this dual role leads to a tension between keeping the core lean and easier to embed (e.g. fewer dependencies) and making a better CLI experience ## Examples of Friction I have recently seen some PRs that are basically integrations that would make datafusion-cli a better end user tool, but bring more dependencies and complexity to DataFusion. For example 1. Hugging face https://github.com/apache/datafusion/pull/10792 from @xinlifoobar 2. FlightSQL: https://github.com/apache/datafusion/pull/11938 from @ccciudatu I realize I have been partly responsible for this mess and for that I apologize. # Proposal I propose resolving this conflict by creating a new repository for the "CLI tool people actually use" We would keep `datafusion-cli` as it is, a relatively small and a thin wrapper around the core engine. I don't think we should remove features but we also wouldn't add them (other than what was added to the engine by default) We would add many new features / capabilitues to this `dfdb` tool ## Examples of new features There are several obvious examples of integrations that would be super useful for users of a CLI tool but not appropriate for the datafusion repo (due to circular dependencies, for example): * apache iceberg: https://github.com/apache/iceberg-rust * delta-rs: https://github.com/delta-io/delta-rs * hudi: https://github.com/apache/hudi-rs @philippemnoel actually referrs to the [lack of built in Apache Iceberg support in his blog](https://blog.paradedb.com/pages/iceberg_lakehouse) about switching to using duckdb. This is sad given all the code to use datafuson and delta exists, there just isn't a pre-integrated binary that shows how to hook it up and it easy to get up and use ## Other cool features There are many other cool features I have dreamed about adding to a CLI that might be more appropriate in a separate repo. Some ideas to inspire: 1. Local catalog support (imagine if you could store your `CREATE EXTERNAL TABLE` definitions in a file someere (`.open <filename>` style) 2. Local parquet metadata cache (imagine being able to cache the parquet metadata for 100s of files in object store in some sort of persistence format so future queries were fast) 3. SQL auto completion, 4. etc. # Open questions ## Should the new tool be in the [`datafusion-contrib`](https://github.com/datafusion-contrib) organization or the `apache` organization? The tradeoffs are that `datafusion-contrib` could move faster / has less governance overhead, but would also lose the apache community I personally suggest we start with this tool in the `datafusion-contrib` organization and if there is interest we can discuss bringing it back to the apache organization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org