Jimexist edited a comment on issue #1096: URL: https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770
Thanks for raising this issue and question @rupurt. First off, I don't believe there's any centralized view on what datafusion-cli is or isn't, and there's no process yet to determine that. However I can share some of my thinkings here. ## What datafusion-cli is not Our ecosystem is full of different tools to manipulate data, each fitting its own niche purpose. In my opinion, datafusion-cli should not try to be yet another general-purpose tool to just manipulate data, especially since datafusion itself is intended to be an embeddable component for other tools (e.g. [cube](https://cube.dev/), [ballista](https://github.com/ballista-compute/ballista), [roapi](https://roapi.github.io/docs/)), to avoid confusion or reduce fragmented tech investment. Specifically, datafusion-cli is not: 1. a general purpose Python enabled tool to query data, for that you'll have the [Python binding](https://pypi.org/project/datafusion) for datafusion itself or [polar](https://github.com/pola-rs/polars) for its speed and pandas compatibility, both leveraging Arrow and datafusion underneath 2. a command line tool to manipulate small, tabular or structured data, for that you'll have [xsv](https://github.com/BurntSushi/xsv), [jq](https://github.com/stedolan/jq), or [rq](https://github.com/dflemstr/rq), depending on the file formats that one wants 3. a client to an HTTP or GraphQL enabled server backend, for that, you can have [roapi](https://roapi.github.io/docs/) or similar things, or in many cases Spark or Presto is just fine (when data size is large) Also for 3. please note that (AFAIK) datafusion and datafusion-cli themselves do not concern with distributed computing, i.e. data sharding is something built on top of them - they can only do in-memory, uniformly accessible data manipulation. ## What datafusion-cli can be Given the above assertions, I believe the place where datafusion-cli can shine is: 1. the data size is large enough so that simple tools like `jq` or `xsv` can't cut it (within reasonable amount of time), but still small enough that can be fit into memory ([EC2 machine has up to 12TB](https://aws.amazon.com/about-aws/whats-new/2021/05/four-ec2-high-memory-instances-with-up-to-12tb-memory-available-with-on-demand-and-savings-plan-purchase-options/)) - if you do care about the speed 2. when there's no need or necessity to keep a long running server or adopt full stack of Spark or Presto cluster due to their high maintenance cost I think in many cases of ETL you do encounter these type of need for “glue” script that can just run with a simple shell script, either triggered manually, via Airflow, or crontab, etc. And it runs to the end, without any setup needed, takes in one or more parquet files, an SQL script, and spits out target data, either in JSON, CSV, or maybe Parquet. It can also be a REPL, an interactive shell to quickly verify commands and look into data ## Considerations on API I think there are two aspects of API to this CLI: 1. the command line API (flags, options, naming, etc.) 2. the query syntax itself For 2. it's clear that Postgres-compatible SQL is the choice here, but for 1. I don't think we are necessarily bound to the peculiarity of `psql` itself because: 1. it's in many cases specific to Postgres and the fact that it's a client-server architecture 2. `psql` is in many cases used interactively, but when it comes to shell automation, there are all sorts of other scripts in use, e.g. `pg_ctl`, `pg_dumpall`, etc. and here they all map to datafusion-cli Having said that, I do think commands like `\copy` are already familiarised within users and nice to have, but I admit that in many cases I am still confused on the different flags and behaviors when that command is used interactively versus used as a CLI option. We need to be more consistent here, and just not necessarily consistent with `psql` itself. ## What comes next Although I'm not recently working on this area, some so call "roadmap" that I have in mind would be: 1. to build a fully abstracted layer of repl parsing so that queries and commands are separated and handled correctly - currently it's kind of a hack 2. to widen the usage by publishing to apt, brew, and possible NuGet registry so that people can start using it more 3. maybe adopt a shorter name, like `dfcli`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
