Jimexist edited a comment on issue #1096:
URL: 
https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770


   Thanks for raising this issue and question @rupurt.
   
   First off, I don't believe there's any centralized view on what 
datafusion-cli is or isn't, and there's no process yet to determine that. 
However I can share some of my thinkings here.
   
   ## What datafusion-cli is not
   
   Our ecosystem is full of different tools to manipulate data, each fitting 
its own niche purpose. In my opinion, datafusion-cli should not try to be yet 
another general-purpose tool to just manipulate data, especially since 
datafusion itself is intended to be an embeddable component for other tools 
(e.g. [cube](https://cube.dev/), 
[ballista](https://github.com/ballista-compute/ballista), 
[roapi](https://roapi.github.io/docs/)), to avoid confusion or reduce 
fragmented tech investment.
   
   Specifically, datafusion-cli is not:
   1. a general purpose Python enabled tool to query data, for that you'll have 
the [Python binding](https://pypi.org/project/datafusion) for datafusion itself 
or [polar](https://github.com/pola-rs/polars) for its speed and pandas 
compatibility, both leveraging Arrow and datafusion underneath
   2. a command line tool to manipulate small, tabular or structured data, for 
that you'll have [xsv](https://github.com/BurntSushi/xsv), 
[jq](https://github.com/stedolan/jq), or [rq](https://github.com/dflemstr/rq), 
depending on the file formats that one wants
   3. a client to an HTTP or GraphQL enabled server backend, for that, you can 
have [roapi](https://roapi.github.io/docs/) or similar things, or in many cases 
Spark or Presto is just fine (when data size is large)
   
   Also for 3. please note that (AFAIK) datafusion and datafusion-cli 
themselves do not concern with distributed computing, i.e. data sharding is 
something built on top of them - they can only do in-memory, uniformly 
accessible data manipulation.
   
   ## What datafusion-cli can be
   
   Given the above assertions, I believe the place where datafusion-cli can 
shine is:
   1. the data size is large enough so that simple tools like `jq` or `xsv` 
can't cut it (within reasonable amount of time), but still small enough that 
can be fit into memory ([EC2 machine has up to 
12TB](https://aws.amazon.com/about-aws/whats-new/2021/05/four-ec2-high-memory-instances-with-up-to-12tb-memory-available-with-on-demand-and-savings-plan-purchase-options/))
 - if you do care about the speed
   2. when there's no need or necessity to keep a long running server or adopt 
full stack of Spark or Presto cluster due to their high maintenance cost
   
   I think in many cases of ETL you do encounter these type of need for “glue” 
script that can just run with a simple shell script, either triggered manually, 
via Airflow, or crontab, etc. And it runs to the end, without any setup needed, 
takes in one or more parquet files, an SQL script, and spits out target data, 
either in JSON, CSV, or maybe Parquet.
   
   It can also be a REPL, an interactive shell to quickly verify commands and 
look into data
   
   ## Considerations on API
   
   I think there are two aspects of API to this CLI:
   1. the command line API (flags, options, naming, etc.)
   2. the query syntax itself
   
   For 2. it's clear that Postgres-compatible SQL is the choice here, but for 
1. I don't think we are necessarily bound to the peculiarity of `psql` itself 
because:
   1. it's in many cases specific to Postgres and the fact that it's a 
client-server architecture
   2. `psql` is in many cases used interactively, but when it comes to shell 
automation, there are all sorts of other scripts in use, e.g. `pg_ctl`, 
`pg_dumpall`, etc. and here they all map to datafusion-cli
   
   Having said that, I do think commands like `\copy` are already familiarised 
within users and nice to have, but I admit that in many cases I am still 
confused on the different flags and behaviors when that command is used 
interactively versus used as a CLI option. We need to be more consistent here, 
and just not necessarily consistent with `psql` itself.
   
   ## What comes next
   
   Although I'm not recently working on this area, some so call "roadmap" that 
I have in mind would be:
   1. to build a fully abstracted layer of repl parsing so that queries and 
commands are separated and handled correctly - currently it's kind of a hack 
(e.g. SIG_INT isn't properly handled)
   2. hook up the stats subsystem and have the cli print out more stats for 
query debugging, etc.
   3. better error handling for interactive use _and_ shell scripting usage
   2. to widen the usage by publishing to apt, brew, and possible NuGet 
registry so that people can start using it more
   3. maybe adopt a shorter name, like `dfcli`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to