djouallah opened a new issue, #695:
URL: https://github.com/apache/arrow-rs-object-store/issues/695

   # `MicrosoftAzure::list_with_offset` returns empty on OneLake since 0.13.0 
(regression from #623)
   
   ## Describe the bug
   
   Against Microsoft Fabric OneLake (`*.dfs.fabric.microsoft.com`), 
`ObjectStore::list_with_offset(prefix, offset)` returns **zero** entries even 
when the prefix contains files strictly greater than `offset`. The equivalent 
`list(prefix)` on the same store returns the correct files, so the data is 
reachable — only the offset-based listing is broken.
   
   This regressed in `object_store` 0.13.0 via #623, which replaced the default 
fallback with an Azure-specific implementation that uses the ADLS Gen2 
`startFrom` URI parameter. OneLake's REST surface does not handle `startFrom` 
the same way the standard ADLS Gen2 endpoint does.
   
   ## Impact
   
   Every downstream that uses `list_with_offset` against OneLake is broken on 
`object_store >= 0.13.0`:
   
   - `delta-kernel-rs` (used by DuckDB's delta extension, delta-rs): loading a 
Delta table with a `_last_checkpoint` hint fails with `Invalid Checkpoint: Had 
a _last_checkpoint hint but didn't find any checkpoints`. See 
[delta-io/delta-kernel-rs#2433](https://github.com/delta-io/delta-kernel-rs/issues/2433)
 and the (now-closed) workaround attempt 
[#2437](https://github.com/delta-io/delta-kernel-rs/pull/2437).
   - `lakehq/sail` (does NOT use delta-kernel-rs; independently hits the same 
bug): [lakehq/sail#1730](https://github.com/lakehq/sail/issues/1730).
   
   ## To Reproduce
   
   Minimal, no-delta-kernel reproducer below. The only thing swapped between 
the two runs is the `object_store` pin.
   
   `Cargo.toml`:
   ```toml
   [package]
   name = "onelake-repro"
   version = "0.0.1"
   edition = "2021"
   
   [dependencies]
   # Swap between "=0.12.5" (works) and "=0.13.2" (broken)
   object_store = { version = "=0.13.2", features = ["azure"] }
   futures = "0.3"
   tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
   url = "2"
   anyhow = "1"
   ```
   
   `src/main.rs`:
   ```rust
   use std::env;
   
   use anyhow::{anyhow, Context, Result};
   use futures::stream::StreamExt;
   use object_store::azure::{AzureConfigKey, MicrosoftAzureBuilder};
   use object_store::path::Path;
   use object_store::{ObjectMeta, ObjectStore};
   
   #[tokio::main(flavor = "multi_thread", worker_threads = 2)]
   async fn main() -> Result<()> {
       let args: Vec<String> = env::args().collect();
       if args.len() != 5 {
           return Err(anyhow!("usage: onelake-repro <workspace> <lakehouse> 
<table> <checkpoint_version>"));
       }
       let workspace = &args[1];
       let lakehouse = &args[2];
       let table = &args[3];
       let ckpt_version: u64 = args[4].parse()?;
   
       let token = env::var("AZURE_STORAGE_TOKEN")
           .context("AZURE_STORAGE_TOKEN not set")?;
   
       let url = format!(
           
"abfss://{workspace}@onelake.dfs.fabric.microsoft.com/{lakehouse}.Lakehouse/Tables/{table}/"
       );
   
       let store = MicrosoftAzureBuilder::new()
           .with_url(url.as_str())
           .with_config(AzureConfigKey::Token, token)
           .build()?;
   
       let prefix_str = 
format!("{lakehouse}.Lakehouse/Tables/{table}/_delta_log");
       let prefix = Path::from(prefix_str.as_str());
       let offset = 
Path::from(format!("{prefix_str}/{ckpt_version:020}").as_str());
   
       let a = collect(store.list(Some(&prefix))).await?;
       println!("A) list(prefix): {} entries", a.len());
       for loc in &a { println!("   {loc}"); }
   
       let b = collect(store.list_with_offset(Some(&prefix), &offset)).await?;
       println!("\nB) list_with_offset(prefix, offset): {} entries", b.len());
       for loc in &b { println!("   {loc}"); }
   
       Ok(())
   }
   
   async fn collect<S>(mut s: S) -> Result<Vec<String>>
   where S: futures::Stream<Item = object_store::Result<ObjectMeta>> + Unpin
   {
       let mut out = vec![];
       while let Some(m) = s.next().await { out.push(m?.location.to_string()); }
       out.sort();
       Ok(out)
   }
   ```
   
   Run:
   ```bash
   export AZURE_STORAGE_TOKEN=$(az account get-access-token --resource 
https://storage.azure.com/ --query accessToken -o tsv)
   cargo run --release -- <workspace> <lakehouse> <table> <checkpoint_version>
   ```
   
   ## Expected behavior
   
   `list_with_offset(prefix, offset)` should return exactly the files in 
`list(prefix)` whose location is lexicographically greater than `offset`.
   
   ## Actual behavior
   
   Against the same OneLake table (a Delta table with `_last_checkpoint` at 
v10):
   
   **With `object_store = "=0.12.5"`** (works):
   ```
   A) list(prefix): 11 entries
      _delta_log/00000000000000000005.json
      _delta_log/00000000000000000006.json
      _delta_log/00000000000000000007.json
      _delta_log/00000000000000000008.json
      _delta_log/00000000000000000009.json
      _delta_log/00000000000000000010.checkpoint.parquet
      _delta_log/00000000000000000010.json
      _delta_log/00000000000000000011.json
      _delta_log/00000000000000000012.json
      _delta_log/00000000000000000013.json
      _delta_log/_last_checkpoint
   
   B) list_with_offset(prefix, _delta_log/00000000000000000010): 6 entries
      _delta_log/00000000000000000010.checkpoint.parquet
      _delta_log/00000000000000000010.json
      _delta_log/00000000000000000011.json
      _delta_log/00000000000000000012.json
      _delta_log/00000000000000000013.json
      _delta_log/_last_checkpoint
   ```
   
   **With `object_store = "=0.13.2"`** (broken):
   ```
   A) list(prefix): 11 entries       <-- identical to above
      ...
   
   B) list_with_offset(prefix, _delta_log/00000000000000000010): 0 entries
   ```
   
   (Only `list_with_offset` differs between the two runs.)
   
   ## Suspected cause
   
   [#623](https://github.com/apache/arrow-rs-object-store/pull/623) added a 
direct `list_with_offset` implementation for Azure that sends 
`startFrom=<offset>` per the [ADLS Gen2 list-blobs 
API](https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?view=rest-storageservices-datalakestoragegen2-2019-12-12&tabs=microsoft-entra-id#uri-parameters).
 OneLake's endpoint apparently does not implement `startFrom` compatibly — it 
returns an empty list regardless of the offset value.
   
   This matches `lonless9`'s analysis on 
[lakehq/sail#1730](https://github.com/lakehq/sail/issues/1730) and the related 
[Azurite#2619](https://github.com/Azure/Azurite/issues/2619#issuecomment-3660701055).
   
   ## Environment
   
   - `object_store` 0.13.2 (and 0.13.0, 0.13.1 — all contain #623)
   - OneLake endpoint `onelake.dfs.fabric.microsoft.com`
   - Service-principal / Azure CLI bearer token (same auth in both runs; auth 
is not the issue)
   - Observed on Windows 11 / rustc 1.95.0, but not platform-dependent
   
   ---
   
   Repro and report co-drafted with [Claude 
Code](https://claude.com/claude-code) (Claude Opus 4.7).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to