[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1096: Vision for datafusion-cli

GitBox Sun, 10 Oct 2021 00:04:29 -0700


Jimexist edited a comment on issue #1096:
URL: 
https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770

Thanks for raising this issue and question @rupurt.

First off, I don't believe there's any centralized view on what
datafusion-cli is or isn't, and there's no process yet to determine that.
However I can share some of my thinkings here.

## What datafusion-cli is not

Our ecosystem is full of different tools to manipulate data, each fitting
its own niche purpose. In my opinion, datafusion-cli should not try to be yet
another general-purpose tool to just manipulate data, especially since
datafusion itself is intended to be an embeddable component for other tools
(e.g. [cube](https://cube.dev/),
[ballista](https://github.com/ballista-compute/ballista),
[roapi](https://roapi.github.io/docs/)), to avoid confusion or reduce
fragmented tech investment.

Specifically, datafusion-cli is not:
1. a general purpose Python enabled tool to query data, for that you'll have
the [Python binding](https://pypi.org/project/datafusion) for datafusion itself
or [polar](https://github.com/pola-rs/polars) for its speed and pandas
compatibility, both leveraging Arrow and datafusion underneath
2. a command line tool to manipulate small, tabular or structured data, for
that you'll have [xsv](https://github.com/BurntSushi/xsv),
[jq](https://github.com/stedolan/jq), or [rq](https://github.com/dflemstr/rq),
depending on the file formats that one wants
3. a client to an HTTP or GraphQL enabled server backend, for that, you can
have [roapi](https://roapi.github.io/docs/) or similar things, or in many cases
Spark or Presto is just fine (when data size is large)

Also for 3. please note that (AFAIK) datafusion and datafusion-cli
themselves do not concern with distributed computing, i.e. data sharding is
something built on top of them - they can only do in-memory, uniformly
accessible data manipulation.

## What datafusion-cli can be

Given the above assertions, I believe the place where datafusion-cli can
shine is:
1. the data size is large enough so that simple tools like `jq` or `xsv`
can't cut it (within reasonable amount of time), but still small enough that
can be fit into memory ([EC2 machine has up to
12TB](https://aws.amazon.com/about-aws/whats-new/2021/05/four-ec2-high-memory-instances-with-up-to-12tb-memory-available-with-on-demand-and-savings-plan-purchase-options/))
- if you do care about the speed
2. when there's no need or necessity to keep a long running server or adopt
full stack of Spark or Presto cluster due to their high maintenance cost

I think in many cases of ETL you do encounter these type of need for “glue”
script that can just run with a simple shell script, either triggered manually,
via Airflow, or crontab, etc. And it runs to the end, without any setup needed,
takes in one or more parquet files, an SQL script, and spits out target data,
either in JSON, CSV, or maybe Parquet.

It can also be a REPL, an interactive shell to quickly verify commands and
look into data

## Considerations on API

I think there are two aspects of API to this CLI:
1. the command line API (flags, options, naming, etc.)
2. the query syntax itself

For 2. it's clear that Postgres-compatible SQL is the choice here, but for
1. I don't think we are necessarily bound to the peculiarity of `psql` itself
because:
1. it's in many cases specific to Postgres and the fact that it's a
client-server architecture
2. `psql` is in many cases used interactively, but when it comes to shell
automation, there are all sorts of other scripts in use, e.g. `pg_ctl`,
`pg_dumpall`, etc. and here they all map to datafusion-cli

Having said that, I do think commands like `\copy` are already familiarised
within users and nice to have, but I admit that in many cases I am still
confused on the different flags and behaviors when that command is used
interactively versus used as a CLI option. We need to be more consistent here,
and just not necessarily consistent with `psql` itself.

## What comes next

Although I'm not recently working on this area, some so call "roadmap" that
I have in mind would be:
1. to build a fully abstracted layer of repl parsing so that queries and
commands are separated and handled correctly - currently it's kind of a hack
(e.g. SIG_INT isn't properly handled)
2. hook up the stats subsystem and have the cli print out more stats for
query debugging, etc.
3. better error handling for interactive use _and_ shell scripting usage
2. to widen the usage by publishing to apt, brew, and possible NuGet
registry so that people can start using it more
3. maybe adopt a shorter name, like `dfcli`?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1096: Vision for datafusion-cli

Reply via email to