[I] Proposal: Create `dfdb`, a new CLI different than `datafusion-cli` with pre-built integrations [datafusion]

via GitHub Wed, 14 Aug 2024 04:51:30 -0700


alamb opened a new issue, #11979:
URL: https://github.com/apache/datafusion/issues/11979


   # TLDR
   * Keep `datafusion-cli` in the apache/datafusion repo
   * Make a new repo with a new CLI called `dfdb` (or `datafusion-cli++`or 
`dfcli`) which is purposely designed for running queries against a wide variety 
of pre-integrated sources
   
   # Problem Statement
   
   As of today, `datafusion-cli` 
([docs](https://datafusion.apache.org/user-guide/cli/index.html)) serves two 
roles:
   1. A debugging / testing tool for the DataFusion query engine developers
   2. A CLI tool for actually doing useful processing if files (locally and 
remotely using object store), similar to the `duckdb` CLI tool
   
   It is really sweet to have a CLI that lets you query a directory of parquet 
files
   
   ```sql
   DataFusion CLI v41.0.0
   >
   > select "WatchID", "EventDate", "URL" from hits_partitioned limit 10;
   
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
   | WatchID             | EventDate | URL                                      
                                                            |
   
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
   | 6904841588848398438 | 15895     | 
687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437        
                     |
   ...
   | 7551542980199423249 | 15895     | 
687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437        
                     |
   
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
   10 row(s) fetched.
   Elapsed 0.059 seconds.
   ```
   
   However, similarly to the [discussion with have had with 
`datafusion-pytyhon`](https://github.com/apache/datafusion-python/issues/440) 
this dual role leads to a tension between keeping the core lean and easier to 
embed (e.g. fewer dependencies) and making a better CLI experience
   
   ## Examples of Friction
   
   I have recently seen some PRs  that are basically integrations that would 
make datafusion-cli a better end user tool, but bring more dependencies and 
complexity to DataFusion. For example
   1. Hugging face https://github.com/apache/datafusion/pull/10792 from 
@xinlifoobar 
   2. FlightSQL: https://github.com/apache/datafusion/pull/11938 from @ccciudatu
   
   I realize I have been partly responsible for this mess and for that I 
apologize. 
   
   # Proposal
   I  propose resolving this conflict by creating a new repository for the "CLI 
tool people actually use"
   
   We would keep `datafusion-cli` as it is, a relatively small and a thin 
wrapper around the core engine. I don't think we should remove features but we 
also wouldn't add them (other than what was added to the engine by default)
   
   We would add many new features / capabilitues to this `dfdb` tool
   
   
   ## Examples of new features
   There are several obvious examples of integrations that would be super 
useful for users of a CLI tool but not appropriate for the datafusion repo (due 
to circular dependencies, for example):
   * apache iceberg: https://github.com/apache/iceberg-rust
   * delta-rs: https://github.com/delta-io/delta-rs
   * hudi: https://github.com/apache/hudi-rs
   
   
    @philippemnoel actually referrs to the [lack of built in Apache Iceberg 
support in his blog](https://blog.paradedb.com/pages/iceberg_lakehouse) about 
switching to using duckdb. This is sad given all the code to use datafuson and 
delta exists, there just isn't a pre-integrated binary that shows how to hook 
it up and it easy to get up and use
   
   ## Other cool features 
   There are many other cool features I have dreamed about adding to a CLI that 
might be more appropriate in a separate repo. Some ideas to inspire:
   
   1. Local catalog support (imagine if you could store your `CREATE EXTERNAL 
TABLE` definitions in a file someere (`.open <filename>` style)
   2. Local parquet metadata cache (imagine being able to cache the parquet 
metadata for 100s of files in object store in some sort of persistence format 
so future queries were fast)
   3. SQL auto completion,
   4. etc. 
   
   
   # Open questions
   ## Should the new tool  be in the 
[`datafusion-contrib`](https://github.com/datafusion-contrib) organization or 
the  `apache` organization? 
   
   The tradeoffs are that `datafusion-contrib` could move faster / has less 
governance overhead, but would also lose the apache community
   
   I personally suggest we start with this tool  in the `datafusion-contrib` 
organization and if there is interest we can discuss bringing it back to the 
apache organization. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Proposal: Create `dfdb`, a new CLI different than `datafusion-cli` with pre-built integrations [datafusion]

Reply via email to