andrewthad opened a new issue, #5594:
URL: https://github.com/apache/arrow-datafusion/issues/5594

   This is not necessarily a feature request. Rather, it's a request for either 
a feature to be added or for improved documentation clarifying that the feature 
is not available. To my understanding, datafusion (at the CLI at least) cannot 
read from Apache arrow files. There are two different kinds of Arrow files: the 
`.arrow` file (which has a footer with metadata about block positions) and the 
`.arrows` "streaming" file (which lacks the footer). I've tried out several 
`CREATE EXTERNAL TABLE` invocations:
   
       CREATE EXTERNAL TABLE foo stored as ARROW LOCATION foo.arrow
       CREATE EXTERNAL TABLE foo stored as ARROWS LOCATION foo.arrow
       CREATE EXTERNAL TABLE foo stored as FEATHER LOCATION foo.arrow
   
   They all give an "Unable to find factory for ..." error. After looking 
through more of the documentation for a while and paying attention to what 
wasn't explicitly said, that arrow files are not support as a form of input. I 
think that, if this is the case, it should be mentioned explicitly in the 
documentation. Datafusion's documentation is misleading about arrow being an 
*internal* implementation detail, not an external-facing way to communicate 
with a producer of data. From the readme on GitHub:
   
   > Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet 
and Flight), DataFusion works well with the rest of the big data ecosystem
   
   I read this as meaning the datafusion can consume data in the arrow format. 
Flight is a tool specifically for the purpose of shuffling arrow-formatted data 
around on a network, so it's hard to interpret this as meaning anything else. 
Perhaps this was a goal at some point, or maybe it's possible to do this, but 
it's undocumented.
   
   Here are three mutually exclusive possibilities for improving this situation:
   
   * Document that arrow files are not supported.
   * Document that arrow files are supported (maybe they are and I couldn't 
figure it out!)
   * Support arrow files as a source of data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to