andrewthad opened a new issue, #5594:
URL: https://github.com/apache/arrow-datafusion/issues/5594
This is not necessarily a feature request. Rather, it's a request for either
a feature to be added or for improved documentation clarifying that the feature
is not available. To my understanding, datafusion (at the CLI at least) cannot
read from Apache arrow files. There are two different kinds of Arrow files: the
`.arrow` file (which has a footer with metadata about block positions) and the
`.arrows` "streaming" file (which lacks the footer). I've tried out several
`CREATE EXTERNAL TABLE` invocations:
CREATE EXTERNAL TABLE foo stored as ARROW LOCATION foo.arrow
CREATE EXTERNAL TABLE foo stored as ARROWS LOCATION foo.arrow
CREATE EXTERNAL TABLE foo stored as FEATHER LOCATION foo.arrow
They all give an "Unable to find factory for ..." error. After looking
through more of the documentation for a while and paying attention to what
wasn't explicitly said, that arrow files are not support as a form of input. I
think that, if this is the case, it should be mentioned explicitly in the
documentation. Datafusion's documentation is misleading about arrow being an
*internal* implementation detail, not an external-facing way to communicate
with a producer of data. From the readme on GitHub:
> Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet
and Flight), DataFusion works well with the rest of the big data ecosystem
I read this as meaning the datafusion can consume data in the arrow format.
Flight is a tool specifically for the purpose of shuffling arrow-formatted data
around on a network, so it's hard to interpret this as meaning anything else.
Perhaps this was a goal at some point, or maybe it's possible to do this, but
it's undocumented.
Here are three mutually exclusive possibilities for improving this situation:
* Document that arrow files are not supported.
* Document that arrow files are supported (maybe they are and I couldn't
figure it out!)
* Support arrow files as a source of data
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]