alamb opened a new issue, #4850:
URL: https://github.com/apache/arrow-datafusion/issues/4850

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Similarly to https://github.com/apache/arrow-datafusion/issues/4580, I think 
systems built with datafusion would like to allow their users to quickly 
explore a parquet file with minimal typing
   
   Today  have to type a verbose `CREATE EXTERNAL TABLE`... command
   
   It is critical that this feature can be enabled/disabled so that DataFusion 
can provide read only access (rather than access to the file system as that 
would be a security hole)
   
   I am marking this as a good first issue because I think all the code needed 
exists and there is a solution sketch below -- it should be a matter of coding 
that doesn't require huge existing knowledge of the datafusion codebase and 
would be a good exercise in getting familiar
   
   ```sql
   ❯ CREATE EXTERNAL TABLE all_types_tiny_pages STORED AS PARQUET LOCATION 
'./parquet-testing/data/alltypes_tiny_pages.parquet';
   
   ❯ select * from all_types_tiny_pages limit 10;
   
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
   | id  | bool_col | tinyint_col | smallint_col | int_col | bigint_col | 
float_col | double_col         | date_string_col | string_col | timestamp_col   
        | year | month |
   
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
   | 122 | true     | 2           | 2            | 2       | 20         | 2.2   
    | 20.2               | 01/13/09        | 2          | 
2009-01-13T01:02:05.410 | 2009 | 1     |
   | 123 | false    | 3           | 3            | 3       | 30         | 3.3   
    | 30.299999999999997 | 01/13/09        | 3          | 
2009-01-13T01:03:05.430 | 2009 | 1     |
   | 124 | true     | 4           | 4            | 4       | 40         | 4.4   
    | 40.4               | 01/13/09        | 4          | 
2009-01-13T01:04:05.460 | 2009 | 1     |
   | 125 | false    | 5           | 5            | 5       | 50         | 5.5   
    | 50.5               | 01/13/09        | 5          | 
2009-01-13T01:05:05.500 | 2009 | 1     |
   | 126 | true     | 6           | 6            | 6       | 60         | 6.6   
    | 60.599999999999994 | 01/13/09        | 6          | 
2009-01-13T01:06:05.550 | 2009 | 1     |
   | 127 | false    | 7           | 7            | 7       | 70         | 7.7   
    | 70.7               | 01/13/09        | 7          | 
2009-01-13T01:07:05.610 | 2009 | 1     |
   | 128 | true     | 8           | 8            | 8       | 80         | 8.8   
    | 80.8               | 01/13/09        | 8          | 
2009-01-13T01:08:05.680 | 2009 | 1     |
   | 129 | false    | 9           | 9            | 9       | 90         | 9.9   
    | 90.89999999999999  | 01/13/09        | 9          | 
2009-01-13T01:09:05.760 | 2009 | 1     |
   | 130 | true     | 0           | 0            | 0       | 0          | 0     
    | 0                  | 01/14/09        | 0          | 
2009-01-14T01:10:05.850 | 2009 | 1     |
   | 131 | false    | 1           | 1            | 1       | 10         | 1.1   
    | 10.1               | 01/14/09        | 1          | 
2009-01-14T01:11:05.850 | 2009 | 1     |
   
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
   10 rows in set. Query took 0.039 seconds.
   ```
   
   @unconsolable  added this ability into `datafusion-cli`  as part of 
https://github.com/apache/arrow-datafusion/pull/4838 (❤️ )
   
   ```sql
   ❯ select * from 
'/Users/alamb/.influxdb_iox//1/8/1/6/e24b6549-f76c-4fc0-a4f4-152ed60eb4e3.parquet';
   
+---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+
   | blocked | host                | idle | running | sleeping | stopped | time 
               | total | unknown | zombies |
   
+---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+
   | 0       | MacBook-Pro-8.local | 0    | 2       | 697      | 0       | 
2022-07-18T21:05:10 | 700   | 0       | 1       |
   | 0       | MacBook-Pro-8.local | 0    | 2       | 696      | 0       | 
2022-07-18T21:05:20 | 699   | 0       | 1       |
   
+---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+
   2 rows in set. Query took 0.079 seconds.
   ```
   
   **Describe the solution you'd like**
   
   I would like to be able to select directly from files (parquet, or other) 
from *any* datafusion session context, controlled by a setting.  For example
   
   ```sql
   ❯ select * from 'foo.parquet';
   Plan("table 'foo.parquet' not found")
   ❯ -- turn on the setting
   ❯ set datafusion.catalog.files_as_tables = true;
   0 rows in set. Query took 0.000 seconds.
   ❯ select * from 'foo.parquet';
   
+---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+
   | blocked | host                | idle | running | sleeping | stopped | time 
               | total | unknown | zombies |
   
+---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+
   | 0       | MacBook-Pro-8.local | 0    | 2       | 697      | 0       | 
2022-07-18T21:05:10 | 700   | 0       | 1       |
   | 0       | MacBook-Pro-8.local | 0    | 2       | 696      | 0       | 
2022-07-18T21:05:20 | 699   | 0       | 1       |
   
+---------+---------------------+------+---------+----------+---------+---------------------+-------+---------+---------+
   2 rows in set. Query took 0.079 seconds.
   ```
   
   
   **Suggested Solution Sketch**
   1. Add a new config setting `files_as_tables` similar to 
`information_schema`: 
https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/common/src/config.rs#L167-L169
   2. Add code to make a `ListingTable` in `resolve_table_ref`: 
https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/core/src/execution/context.rs#L1551-L1560
 (follow the model in 
https://github.com/apache/arrow-datafusion/pull/4838/files#diff-6353c2268d4d11abf8c1b8804a263db74a3b765a7302fc61caea3924256b52c7R142-R155)
   
   2. Move implementation from datafusion-cli; remove provider added in 
https://github.com/apache/arrow-datafusion/pull/4838 and use new setting 
instead 
https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion-cli/src/main.rs#L100
   Add slt tests, similar to existing ones (should be able to refer to existing 
.parquet / .csv files in testing directories): 
https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/core/tests/sqllogictests/test_files/information_schema.slt#L46
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   
   
   Here is how information schema works, for reference.  
   
   ```
   ❯ set datafusion.catalog.information_schema = false;
   0 rows in set. Query took 0.002 seconds.
   ❯ show tables;
   Plan("SHOW TABLES is not supported unless information_schema is enabled")
   ❯ set datafusion.catalog.information_schema = true;
   0 rows in set. Query took 0.000 seconds.
   ❯ show tables;
   +---------------+--------------------+-------------+------------+
   | table_catalog | table_schema       | table_name  | table_type |
   +---------------+--------------------+-------------+------------+
   | datafusion    | information_schema | tables      | VIEW       |
   | datafusion    | information_schema | views       | VIEW       |
   | datafusion    | information_schema | columns     | VIEW       |
   | datafusion    | information_schema | df_settings | VIEW       |
   +---------------+--------------------+-------------+------------+
   4 rows in set. Query took 0.005 seconds.
   ❯ 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to