alamb commented on code in PR #19388:
URL: https://github.com/apache/datafusion/pull/19388#discussion_r2661635045
##########
docs/source/user-guide/cli/functions.md:
##########
@@ -170,5 +170,43 @@ The columns of the returned table are:
| table_size_bytes | Utf8 | Size of the table, in bytes
|
| statistics_size_bytes | UInt64 | Size of the cached statistics in memory
|
+## `list_files_cache`
+
+The `list_files_cache` function shows information about the `ListFilesCache`
that is used by the [`ListingTable`] implementation in DataFusion. When
creating a [`ListingTable`], DataFusion lists the files in the table's location
and caches results in the `ListFilesCache`. Subsequent queries against the same
table can reuse this cached information instead of re-listing the files.
+
+You can inspect the cache by querying the `list_files_cache` function. For
example,
Review Comment:
It might be cool to add the code to make this a full self-contained
reproducer
For example
```sql
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION
's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
...
```
And then change the results of the query below to
```sql
> select split_part(path, '/', -1) as folder, metadata_size_bytes,
expires_in, unnest(metadata_list)['file_size_bytes'] as file_size_bytes,
unnest(metadata_list)['e_tag'] as e_tag from list_files_cache();
+------------------+---------------------+------------+-----------------+---------------------------------------+
| folder | metadata_size_bytes | expires_in | file_size_bytes |
e_tag |
+------------------+---------------------+------------+-----------------+---------------------------------------+
| tripdata_parquet | 18138 | NULL | 222192983 |
"e8d016c3c7af80bf911d96387febe2c1-13" |
| tripdata_parquet | 18138 | NULL | 211023080 |
"1021626ff5ef606422aa7121edd69f3b-12" |
| tripdata_parquet | 18138 | NULL | 229202874 |
"96e7494b217099c6a07e9c4298cbe783-13" |
| tripdata_parquet | 18138 | NULL | 225659965 |
"728c45fabdcd8e40bdef4dfc28df9b0f-13" |
| tripdata_parquet | 18138 | NULL | 232847306 |
"f59e45bd8bd1d77cd7ae8ab6ab468bcc-13" |
| tripdata_parquet | 18138 | NULL | 224226575 |
"8ebb698eea85f9af87065ac333efc449-13" |
| tripdata_parquet | 18138 | NULL | 217168413 |
"7d7ee77f6cac4adc18aa3a9e74600dd3-12" |
| tripdata_parquet | 18138 | NULL | 217303109 |
"e9883055d92a33b941aab971423e681b-12" |
| tripdata_parquet | 18138 | NULL | 223333499 |
"6f0917e6003b38df9060d71c004eb961-13" |
| tripdata_parquet | 18138 | NULL | 246300471 |
"8928b29da44e041021e10077683b7817-14" |
| tripdata_parquet | 18138 | NULL | 227920860 |
"4cd26a1a7f82af080c33e890dc1fef27-13" |
| tripdata_parquet | 18138 | NULL | 233873308 |
"23f4584e494e3c065c777c270c9eedbc-13" |
| tripdata_parquet | 18138 | NULL | 235166925 |
"effcc8cc41b40cf7ac466f911d7b9459-14" |
| tripdata_parquet | 18138 | NULL | 177367931 |
"ce8b7817ecc47da86ccbfa6b51ffa06b-10" |
| tripdata_parquet | 18138 | NULL | 205857224 |
"94a078b61e3b652387e6f2a673dc3f4e-12" |
| tripdata_parquet | 18138 | NULL | 243024246 |
"a1efbebfdabc204e0041d8714aaec01a-14" |
| tripdata_parquet | 18138 | NULL | 248130090 |
"d3cf585e00ce627a807348c84a42d0a6-14" |
| tripdata_parquet | 18138 | NULL | 237068130 |
"831db33281a5c017f8ffc466bd47546b-14" |
| tripdata_parquet | 18138 | NULL | 234826090 |
"790e05983e6592e4920c88fbd2bfe774-13" |
| tripdata_parquet | 18138 | NULL | 197990272 |
"d87ddb446e5cbc0f6831fafd95cfd027-11" |
| tripdata_parquet | 18138 | NULL | 243408943 |
"abfbe3b29942bcd68d131d95540278d3-14" |
| tripdata_parquet | 18138 | NULL | 225277041 |
"f768c7b77497b2bf3efd5cb2a4362977-13" |
| tripdata_parquet | 18138 | NULL | 220010577 |
"c6830cbe1f3ae918f9280db3aa847b03-13" |
| tripdata_parquet | 18138 | NULL | 219773352 |
"264f7ea433076690a3bbe5566168e5c5-13" |
| tripdata_parquet | 18138 | NULL | 212535107 |
"ca3bdc2707b29667c78c39517781eac4-12" |
| tripdata_parquet | 18138 | NULL | 223138164 |
"e2b3c0fd0c0d66ac6363600de0c8b2ad-13" |
| tripdata_parquet | 18138 | NULL | 252843261 |
"fd5d4e01568cd6e7ef1e00de76441e5b-15" |
| tripdata_parquet | 18138 | NULL | 233123935 |
"2b510cc2c0c73d9ec7374c9e6d56c388-13" |
| tripdata_parquet | 18138 | NULL | 246843111 |
"abc2f58bd520b2013aa1a333d317c70c-14" |
| tripdata_parquet | 18138 | NULL | 238786647 |
"0e456698dc42a850ff7b764506cb511d-14" |
| tripdata_parquet | 18138 | NULL | 233249259 |
"28177227cbff94a6a819a0568a14e9b2-13" |
| tripdata_parquet | 18138 | NULL | 212681184 |
"fdcb442e1010630c0553a7018762a8ba-12" |
| tripdata_parquet | 18138 | NULL | 232399266 |
"ccca37be5a3579a8bc644490226ed29a-13" |
| tripdata_parquet | 18138 | NULL | 248471033 |
"eebe34c1bb74f63433eb607810969553-14" |
| tripdata_parquet | 18138 | NULL | 231103826 |
"7c76b9fc111462b76336d63bce3253c7-13" |
| tripdata_parquet | 18138 | NULL | 236102882 |
"26c10d1d85c4565cbb9e8fc6a7bc745c-14" |
| tripdata_parquet | 18138 | NULL | 236184052 |
"8cdc15a22462579dcf90d669cea0f04b-14" |
| tripdata_parquet | 18138 | NULL | 238377570 |
"4e6734c5c2e77c68dde5155a45dac81c-14" |
| tripdata_parquet | 18138 | NULL | 258226172 |
"b7b07fa0f4fefcf0ba0fc69ba344b5c8-15" |
| tripdata_parquet | 18138 | NULL | 248190698 |
"968c13850fa9a7cb46337bc8fc9d13fa-14" |
| .
|
| .
|
| .
|
+------------------+---------------------+------------+-----------------+---------------------------------------+
96 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.024 seconds.
```
##########
docs/source/user-guide/cli/functions.md:
##########
@@ -170,5 +170,43 @@ The columns of the returned table are:
| table_size_bytes | Utf8 | Size of the table, in bytes
|
| statistics_size_bytes | UInt64 | Size of the cached statistics in memory
|
+## `list_files_cache`
+
+The `list_files_cache` function shows information about the `ListFilesCache`
that is used by the [`ListingTable`] implementation in DataFusion. When
creating a [`ListingTable`], DataFusion lists the files in the table's location
and caches results in the `ListFilesCache`. Subsequent queries against the same
table can reuse this cached information instead of re-listing the files.
+
+You can inspect the cache by querying the `list_files_cache` function. For
example,
+
+```sql
+> select split_part(path, '/', -1) as folder, metadata_size_bytes, expires_in,
unnest(metadata_list)['file_size_bytes'] as file_size_bytes,
unnest(metadata_list)['e_tag'] as e_tag from list_files_cache();
++----------+---------------------+-----------------------------------+-----------------+-------------------------------+
+| folder | metadata_size_bytes | expires_in |
file_size_bytes | e_tag |
++----------+---------------------+-----------------------------------+-----------------+-------------------------------+
+| customer | 1592 | 0 days 0 hours 0 mins 18.488 secs | 1233969
| 7041136-643a7bfeeec9b-12d431 |
+| customer | 1592 | 0 days 0 hours 0 mins 18.488 secs | 1234756
| 7041137-643a7bfeef2df-12d744 |
+| customer | 1592 | 0 days 0 hours 0 mins 18.488 secs | 1232554
| 7041139-643a7bfeef86a-12ceaa |
+| customer | 1592 | 0 days 0 hours 0 mins 18.488 secs | 1238676
| 704113a-643a7bfeef914-12e694 |
+| customer | 1592 | 0 days 0 hours 0 mins 18.488 secs | 1232186
| 704113b-643a7bfeefb22-12cd3a |
+| customer | 1592 | 0 days 0 hours 0 mins 18.488 secs | 1237506
| 7041138-643a7bfeef775-12e202 |
+| customer | 1592 | 0 days 0 hours 0 mins 18.488 secs | 1228756
| 7041134-643a7bfeec2d8-12bfd4 |
+| customer | 1592 | 0 days 0 hours 0 mins 18.488 secs | 1228509
| 7041135-643a7bfeed599-12bedd |
+| lineitem | 1600 | 0 days 0 hours 0 mins 16.758 secs |
20124715 | 704114a-643a7c00bb560-133142b |
+| lineitem | 1600 | 0 days 0 hours 0 mins 16.758 secs |
20131024 | 7041149-643a7c00b90b7-1332cd0 |
+| lineitem | 1600 | 0 days 0 hours 0 mins 16.758 secs |
20179217 | 704114b-643a7c00bb93e-133e911 |
+| lineitem | 1600 | 0 days 0 hours 0 mins 16.758 secs |
20296819 | 704114f-643a7c00ccefd-135b473 |
+| lineitem | 1600 | 0 days 0 hours 0 mins 16.758 secs |
20110730 | 7041148-643a7c00b9832-132dd8a |
+| lineitem | 1600 | 0 days 0 hours 0 mins 16.758 secs |
20128346 | 704114c-643a7c00bc00a-133225a |
+| lineitem | 1600 | 0 days 0 hours 0 mins 16.758 secs |
20130133 | 7041147-643a7c00b3901-1332955 |
+| lineitem | 1600 | 0 days 0 hours 0 mins 16.758 secs |
20139830 | 7041146-643a7c00abbe8-1334f36 |
++----------+---------------------+-----------------------------------+-----------------+-------------------------------+
+```
+
+The columns of the returned table are:
+| column_name | data_type | Description |
+| ------------------- | ------------ |
-----------------------------------------------------------------------------------------
|
+| path | Utf8 | File path relative to the object store / filesystem root |
+| metadata_size_bytes | UInt64 | Size of the cached metadata in memory (not
its thrift encoded form) |
+| expires_in | Duration(ms) | Last modified time of the file |
+| metadata_list | List(Struct) | List of metadatas, one for each file under
the path. |
Review Comment:
I also recommend adding an example output structure fields, for example
```json
{
"file_path": "nyc_taxi_rides/data/tripdata_parquet/data-200901.parquet",
"file_modified": "2025-05-30T09:44:23",
"file_size_bytes": 222192983,
"e_tag": "e8d016c3c7af80bf911d96387febe2c1-13",
"version": null
}
```
##########
datafusion-cli/src/functions.rs:
##########
@@ -697,3 +702,158 @@ impl TableFunctionImpl for StatisticsCacheFunc {
Ok(Arc::new(statistics_cache))
}
}
+
+#[derive(Debug)]
+struct ListFilesCacheTable {
Review Comment:
I think it might help to add some comments and context on what this function
does here, to help future readers
Something like this perhaps:
```suggestion
/// Implementation of the `list_files_cache` table function in datafusion-cli
///
/// This function returns the cached results of running a LIST command on a
particular object
/// store path. The object metadata is returned as a List of Structs, with
one Struct for each object.
/// DataFusion uses these cached results to plan queries against external
tables
///
/// # Schema
/// ```sql
/// > describe select * from list_files_cache();
///
+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
/// | column_name | data_type
| is_nullable |
///
+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
/// | path | Utf8
| NO |
/// | metadata_size_bytes | UInt64
| NO |
/// | expires_in | Duration(ms)
| YES |
/// | metadata_list | List(Struct("file_path": non-null Utf8,
"file_modified": non-null Timestamp(ms), "file_size_bytes": non-null UInt64,
"e_tag": Utf8, "version": Utf8), field: 'metadata') | YES |
///
+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
/// ```
struct ListFilesCacheTable {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]