alamb opened a new issue, #4850: URL: https://github.com/apache/arrow-datafusion/issues/4850
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Similarly to https://github.com/apache/arrow-datafusion/issues/4580, I think systems built with datafusion would like to allow their users to quickly explore a parquet file with minimal typing Today have to type a verbose `CREATE EXTERNAL TABLE`... command It is critical that this feature can be enabled/disabled so that DataFusion can provide read only access (rather than access to the file system as that would be a security hole) I am marking this as a good first issue because I think all the code needed exists and there is a solution sketch below -- it should be a matter of coding that doesn't require huge existing knowledge of the datafusion codebase and would be a good exercise in getting familiar ```sql ❯ CREATE EXTERNAL TABLE all_types_tiny_pages STORED AS PARQUET LOCATION './parquet-testing/data/alltypes_tiny_pages.parquet'; ❯ select * from all_types_tiny_pages limit 10; +-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+ | id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col | string_col | timestamp_col | year | month | +-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+ | 122 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/13/09 | 2 | 2009-01-13T01:02:05.410 | 2009 | 1 | | 123 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/13/09 | 3 | 2009-01-13T01:03:05.430 | 2009 | 1 | | 124 | true | 4 | 4 | 4 | 40 | 4.4 | 40.4 | 01/13/09 | 4 | 2009-01-13T01:04:05.460 | 2009 | 1 | | 125 | false | 5 | 5 | 5 | 50 | 5.5 | 50.5 | 01/13/09 | 5 | 2009-01-13T01:05:05.500 | 2009 | 1 | | 126 | true | 6 | 6 | 6 | 60 | 6.6 | 60.599999999999994 | 01/13/09 | 6 | 2009-01-13T01:06:05.550 | 2009 | 1 | | 127 | false | 7 | 7 | 7 | 70 | 7.7 | 70.7 | 01/13/09 | 7 | 2009-01-13T01:07:05.610 | 2009 | 1 | | 128 | true | 8 | 8 | 8 | 80 | 8.8 | 80.8 | 01/13/09 | 8 | 2009-01-13T01:08:05.680 | 2009 | 1 | | 129 | false | 9 | 9 | 9 | 90 | 9.9 | 90.89999999999999 | 01/13/09 | 9 | 2009-01-13T01:09:05.760 | 2009 | 1 | | 130 | true | 0 | 0 | 0 | 0 | 0 | 0 | 01/14/09 | 0 | 2009-01-14T01:10:05.850 | 2009 | 1 | | 131 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 01/14/09 | 1 | 2009-01-14T01:11:05.850 | 2009 | 1 | +-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+ 10 rows in set. Query took 0.039 seconds. ``` @unconsolable added this ability into `datafusion-cli` as part of https://github.com/apache/arrow-datafusion/pull/4838 (❤️ ) ```sql ❯ select * from '/Users/alamb/.influxdb_iox//1/8/1/6/e24b6549-f76c-4fc0-a4f4-152ed60eb4e3.parquet'; +---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+ | blocked | host | idle | running | sleeping | stopped | time | total | unknown | zombies | +---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+ | 0 | MacBook-Pro-8.local | 0 | 2 | 697 | 0 | 2022-07-18T21:05:10 | 700 | 0 | 1 | | 0 | MacBook-Pro-8.local | 0 | 2 | 696 | 0 | 2022-07-18T21:05:20 | 699 | 0 | 1 | +---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+ 2 rows in set. Query took 0.079 seconds. ``` **Describe the solution you'd like** I would like to be able to select directly from files (parquet, or other) from *any* datafusion session context, controlled by a setting. For example ```sql ❯ select * from 'foo.parquet'; Plan("table 'foo.parquet' not found") ❯ -- turn on the setting ❯ set datafusion.catalog.files_as_tables = true; 0 rows in set. Query took 0.000 seconds. ❯ select * from 'foo.parquet'; +---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+ | blocked | host | idle | running | sleeping | stopped | time | total | unknown | zombies | +---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+ | 0 | MacBook-Pro-8.local | 0 | 2 | 697 | 0 | 2022-07-18T21:05:10 | 700 | 0 | 1 | | 0 | MacBook-Pro-8.local | 0 | 2 | 696 | 0 | 2022-07-18T21:05:20 | 699 | 0 | 1 | +---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+ 2 rows in set. Query took 0.079 seconds. ``` **Suggested Solution Sketch** 1. Add a new config setting `files_as_tables` similar to `information_schema`: https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/common/src/config.rs#L167-L169 2. Add code to make a `ListingTable` in `resolve_table_ref`: https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/core/src/execution/context.rs#L1551-L1560 (follow the model in https://github.com/apache/arrow-datafusion/pull/4838/files#diff-6353c2268d4d11abf8c1b8804a263db74a3b765a7302fc61caea3924256b52c7R142-R155) 2. Move implementation from datafusion-cli; remove provider added in https://github.com/apache/arrow-datafusion/pull/4838 and use new setting instead https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion-cli/src/main.rs#L100 Add slt tests, similar to existing ones (should be able to refer to existing .parquet / .csv files in testing directories): https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/core/tests/sqllogictests/test_files/information_schema.slt#L46 **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. **Additional context** Here is how information schema works, for reference. ``` ❯ set datafusion.catalog.information_schema = false; 0 rows in set. Query took 0.002 seconds. ❯ show tables; Plan("SHOW TABLES is not supported unless information_schema is enabled") ❯ set datafusion.catalog.information_schema = true; 0 rows in set. Query took 0.000 seconds. ❯ show tables; +---------------+--------------------+-------------+------------+ | table_catalog | table_schema | table_name | table_type | +---------------+--------------------+-------------+------------+ | datafusion | information_schema | tables | VIEW | | datafusion | information_schema | views | VIEW | | datafusion | information_schema | columns | VIEW | | datafusion | information_schema | df_settings | VIEW | +---------------+--------------------+-------------+------------+ 4 rows in set. Query took 0.005 seconds. ❯ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
