cj-zhukov commented on PR #18747: URL: https://github.com/apache/datafusion/pull/18747#issuecomment-3569072227
> I tried it out locally > > ```shell > (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run --example external_dependency -- dataframe_to_s3 > Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.33s > Running `target/debug/examples/external_dependency dataframe_to_s3` > > thread 'main' (45830553) panicked at datafusion-examples/examples/external_dependency/dataframe_to_s3.rs:51:59: > called `Result::unwrap()` on an `Err` value: NotPresent > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run --example external_dependency -- query_aws_s3 > Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.19s > Running `target/debug/examples/external_dependency query_aws_s3` > > thread 'main' (45831058) panicked at datafusion-examples/examples/external_dependency/query_aws_s3.rs:45:59: > called `Result::unwrap()` on an `Err` value: NotPresent > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > ``` > > I think the failures are due to the fact I don't have AWS seetup Good catch - and thank you for testing this. After digging into this further (including the discussion in awslabs/open-data-registry#1418), it turns out the `s3://nyc-tlc` bucket no longer allows anonymous access. Both `ListBucket` and `GetObject` now require credentials, which explains why: ```bash aws s3 ls s3://nyc-tlc/ --no-sign-request ``` - Fails with AccessDenied - Your local run panics with NotPresent - My machine sees the same behavior I’m also not fully sure why the example used to work - the best guess is that CI was running it with temporary AWS credentials in the environment, which made the requests signed, and AWS allowed them at the time. But since the bucket now rejects anonymous access entirely, relying on it is no longer viable. Even if it starts working again, it’s outside our control. If it changes permissions again (as it just did), we’ll silently break the example for users. So I suggest we stop depending on nyc-tlc altogether and instead: - Use a user-controlled bucket for the example (as implemented in this PR). This avoids relying on external datasets with changing policies. Also update the docs and inline comments to clearly explain that: users must provide their own S3 bucket and Parquet file and the example expects valid AWS credentials to be configured - Add a comment + README note explaining why the example can’t use NYC TLC anymore and link to the GitHub issue for context. - Optionally, we could switch to another public dataset - but personally I think that’s risky, since we can't guarantee its permissions won’t change in the future. This way, the example will always behave predictably and won’t require users to debug `AccessDenied` errors caused by external policy changes. Let me know what do you think - happy to refine it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
