[I] Support file row index / row id for each file in a `ListingTableProvider` [datafusion]

via GitHub Tue, 29 Apr 2025 04:52:48 -0700


alamb opened a new issue, #15892:
URL: https://github.com/apache/datafusion/issues/15892


   ### Is your feature request related to a problem or challenge?
   
   - Quoting @daphnenhuch-at from 
https://github.com/apache/datafusion/discussions/15711:
   
   > My goal is that I will have a fully sorted file sorted by primary key 
where each fileRowNumber is the index of that row in the file. 
   
   I am not sure what @daphnenhuch-at 's use case is, but getting row numbers 
from a file is used for several  use cases I know of:
   1. Implementing delete vectors (aka filtering out row by row_id has been 
deleted from a file)
   2. Implementing external indexes (e.g. having a full text index that tells 
you document 10001, and 10003 match and then wanting to fetch (only) those rows 
from the file)
   
   Today there are ways to compute this, but they are inefficient (for example, 
the workaround below will read all rows from the file, so if you are trying to 
select only one based on row number a huge amount of work is wasted)
   
   Today you can kind of get this information, by
   
   1.  disable repartitioning by setting 
[datafusion.execution.target_partitions config setting 
](https://datafusion.apache.org/user-guide/configs.html) to `1`. This is 
important to disable repartitioning otherwise large tables will be scanned in 
parallel  and data from multiple parallel chunks will be interleaved
   
   Running a query for each file using the `row_number` window function.  
Something like:
   
   ```rust
   ctx
           .read_parquet("file1.parquet")
           .await?
           .window(vec![row_number().alias(DATA_FUSION_ROW_NUMBER)])
   ```
   
   
   In SQL
   ```sql
   > set datafusion.execution.target_partitions = 1;
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   > select "VendorID", row_number() OVER () from 
'yellow_tripdata_2025-01.parquet' limit 10;
   
+----------+-----------------------------------------------------------------------+
   | VendorID | row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING |
   
+----------+-----------------------------------------------------------------------+
   | 1        | 1                                                               
      |
   | 1        | 2                                                               
      |
   | 1        | 3                                                               
      |
   | 2        | 4                                                               
      |
   | 2        | 5                                                               
      |
   | 2        | 6                                                               
      |
   | 1        | 7                                                               
      |
   | 1        | 8                                                               
      |
   | 1        | 9                                                               
      |
   | 2        | 10                                                              
      |
   
+----------+-----------------------------------------------------------------------+
   10 row(s) fetched.
   Elapsed 0.005 seconds.
   ```
   
   ### Describe the solution you'd like
   
   I would like to consider a nicer way to get the row number from the file and 
then write queries against it.
   
   Something like
   ```sql
   select * from my_table where row_number IN (10002, 10003)
   ```
   
   Which would return the 10,002 and 10,003 row in the file respectively. The 
idea is that then we could:
   1. Do predicate pushdown on those row numbers
   2. Figure out how to still scan the file in parallel
   
   ### Describe alternatives you've considered
   
   I think we would need to add some sort of special column (similar to 
partitioning columns) to the listing table provider
   
   Another alternative would be to keep this kind of functionality out of the 
core and implement it in external table providers
   
   ### Additional context
   
   - Related: https://github.com/apache/datafusion/issues/15173
   - https://github.com/apache/arrow-rs/pull/7307 (adding support for row 
numbers in a parquet reader)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Support file row index / row id for each file in a `ListingTableProvider` [datafusion]

Reply via email to