[I] Fast parquet order inversion [datafusion]

via GitHub Wed, 13 Aug 2025 05:14:14 -0700


crepererum opened a new issue, #17172:
URL: https://github.com/apache/datafusion/issues/17172


   ### Is your feature request related to a problem or challenge?
   
   Often parquet files are written in time-ascending order, i.e. new values are 
appended at the end. However many users would like to know the last value(s) of 
a given context. Currently this means that DataFusion would scan the entire 
parquet file and just keep the relevant bit at the end. However that seems 
wasteful and I think we can do better.
   
   ### Describe the solution you'd like
   
   The solution here overlaps with things that we would need to implement in 
the arrow-rs repository, but I am using an umbrella ticket to illustrate the 
rough idea.
   
   ## Part 1 -- Order Inversion
   If a parquet file is order `col1 ASC, col2 DESC, col3 ASC`, then reading it 
"back to front" should result in data ordered `col1 DESC, col2 ASC, col3 DESC`, 
i.e. the sequence of expressions is the same, but the polarity (ascending or 
descending) of each expression is flipped. So the DataFusion planner should 
(after all of this is implemented) that it is better to inverse the order and 
potentially apply a top-k on that than reading an entire file and perform a 
simple `Sort` operation.
   
   ## Part 2 -- Decode Parquet Pages in Reverse Order
   Ideally we would be able to decode the parquet files backwards, i.e. "last 
value first". However this only works for a subset of 
[encodings](https://parquet.apache.org/docs/file-format/data-pages/encodings/), 
while others are clearly order-dependent. But if we look at a parquet file, 
this order only matters on a page level and technically pages (of a column) can 
be decoded in reverse order. That is true for ALL encodings.
   
   ## Part 3 -- Flip a Page
   Reading the pages in reverse order however is still not enough. Imagine the 
following layout:
   
   ```yaml
   pages:
    - values:
      - A
      - B
      - C
    - values:
      - D
      - E
   ```
   
   Decoding pages in reverse order would yield `D, E, A, B C`. That's not right 
and we need to reverse the order within a page. Now my hypothesis is that 
"flipping" a page would only require inverting the order of an arrow buffer and 
in contrast to the generic 
[`take`](https://docs.rs/arrow/latest/arrow/compute/kernels/take/fn.take.html) 
kernel, this can be implemented in-place and without random access. This highly 
simplified code for an integer array shows the principle:
   
   ```rust
   fn flip_u64_array(array: &mut [u64]) {
       // if you do this iterator instead of index-based,
       // you can even get rid of the bounds-checking
       for idx in 0..(array.len() / 2) {
           array.swap(idx, array.len() - idx);
       }
   }
   ```
   
   One can imagine how this even works for more complex types:
   - dictionary encoded data: flip the array that contains the per-row dict 
keys, but NOT the actual dict
   - nested data (like list): flip top-level array, but not the children (i.e. 
the lists themselves are NOT flipped)
   - run-offset encoded data: flip both the offset and the value array
   - string views: flip the offset array but NOT the values
   
   A `flip` kernel would even be useful for DataFusion in general (outside the 
parquet context), i.e. when we know that we can use this optimization instead 
of a full `sort`.
   
   ### Describe alternatives you've considered
   
   \-
   
   ### Additional context
   
   \-


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Fast parquet order inversion [datafusion]

Reply via email to