timwaizenegger opened a new issue, #388:
URL: https://github.com/apache/arrow-rs-object-store/issues/388

   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   The `list` / 
`[list_with_offset](https://docs.rs/object_store/0.12.0/object_store/trait.ObjectStore.html#method.list_with_offset)`
 functions on LocalFileSystem stores do not (reliably) return results in sorted 
order.
   Now the [interface 
docs](https://docs.rs/object_store/0.12.0/object_store/trait.ObjectStore.html#method.list_with_offset)
 for this interface state that:
   
   > Note: the order of returned 
[ObjectMeta](https://docs.rs/object_store/0.12.0/object_store/struct.ObjectMeta.html)
 is not guaranteed
   
   so it would be consistent with that but I believe this is not useful:
   - offset filtering/listing works by skipping any input objects that are 
(lexicographically) smaller than the offset
   - in order to use offsets to list the all objects in multiple iterations, 
each iteration must contain a slice of the sorted overall results
   - if I have such a slice, the last element will be correct offset to the get 
the next slice/batch
   - if results are in a random order overall, offset listing is impossible to 
use. I can never find any useful offset value out of the ones I received
     - unless I list everything and sort it then. But that defeats the purpose
   - object stores typically list results in sorted order specifically to allow 
for offset listing
   - Having a usable offset listing feature is key for object-sort to achieve 
its design goal of a stateless API; the only alternative for users is to use a 
stateful iterator.
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   generate a set of files:
   ```
   mkdir -p /tmp/manyfiles/
   cd /tmp/manyfiles/
       for i in $(seq 1 5000); do
   echo "hello world" > "hello world.txt.$i"
   done
   ```
   Run this code to show the (random) ordering and show that offset listing 
isn't usable:
   ```
   #[tokio::main]
   async fn main() -> Result<()> {
       let store = LocalFileSystem::new_with_prefix("/tmp/manyfiles")?;
   
       // list() returns a Stream of Result<ObjectMeta>
       let mut list_stream = store.list(None);
   
       // pull each ObjectMeta out of the stream
       while let Some(result) = list_stream.next().await {
           let meta = result?;
           // print its path
           println!("-> {} - {}", meta.location.to_string(), 
meta.last_modified.to_string());
       }
   
   
   
       println!("\nListing files in batches of 10 using list_with_offset...");
       let mut offset: Option<String> = None;
       loop {
           // Choose list or list_with_offset based on whether we have an offset
           let mut batch_stream = if let Some(ref off) = offset {
               store.list_with_offset(None, &Path::from(off.as_str()))
           } else {
               store.list(None)
           };
   
           let mut count = 0;
           let mut last_path: Option<String> = None;
           while let Some(result) = batch_stream.next().await {
               let meta = result?;
               println!("-> {} - {}", meta.location.to_string(), 
meta.last_modified.to_string());
               count += 1;
               last_path = Some(meta.location.to_string());
               if count >= 10 {
                   break;
               }
           }
           // Stop if fewer than 10 items were returned
           if count < 10 {
               break;
           }
           // Update offset for the next batch
           offset = last_path;
       }
   
       Ok(())
   }
   ```
   
   Here is an example of the results:
   ```
   cargo run
       Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.26s
        Running `target/debug/objectstore-list-bug`
   -> hello world.txt.822 - 2025-05-29 19:55:29.630128009 UTC
   -> hello world.txt.3678 - 2025-05-29 19:55:29.979447372 UTC
   -> hello world.txt.1913 - 2025-05-29 19:55:29.765314276 UTC
   -> hello world.txt.2598 - 2025-05-29 19:55:29.844496358 UTC
   -> hello world.txt.1779 - 2025-05-29 19:55:29.749811871 UTC
   -> hello world.txt.3812 - 2025-05-29 19:55:29.994586274 UTC
   ...
   ```
   
   
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   I should be able to use offset listing with local files just like it work 
with S3 and others.
   
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->
   I found this behavior on macos and redhat linux.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to