[I] [Bug]: Unable to use S3 bucket for ReadFromSnowflake staging bucket name [beam]

via GitHub Mon, 03 Mar 2025 15:21:54 -0800


seanhagen opened a new issue, #34151:
URL: https://github.com/apache/beam/issues/34151


   ### What happened?
   
   Is there something special I need to do in order to use the 
ReadFromSnowflake IO source in Python if the staging_bucket_name is in AWS S3?
   
   I've documented the issue I'm encountering more fully in [this StackOverflow 
question](https://stackoverflow.com/questions/79476753/apache-beam-use-s3-bucket-for-snowflake-csv-output-when-using-apache-beam-io-sn),
 but the gist is that when I try to use an S3 bucket for the staging bucket 
Beam throws an error about "no filesystem found for scheme s3". 
   
   The Snowflake side of things seems to be working properly, because I can see 
the gzipped CSVs showing up in the bucket -- ie, there's an object at 
`s3://my-bucket//sf_copy_csv_20250303_145430_11651ed9/run_825c058f/data_0_0_0.csv.gz`
 -- but the expansion service just doesn't seem to be able to read the object 
because the s3 scheme isn't registered as a filesystem?
   
   Is there something extra I have to do in my Python code to fix this? Do I 
have to run the expansion service manually to pass some additional arguments?
   
   Additionally, I'm a bit confused by the docs; they seem to be a bit 
contradictory. [These 
docs](https://beam.incubator.apache.org/documentation/io/built-in/snowflake/#reading-from-snowflake-1)
 state I can use an S3 bucket, but the 
[pydocs](https://beam.apache.org/releases/pydoc/2.63.0/apache_beam.io.snowflake.html)
 for the Snowflake module don't mention S3 at all. It seems like the pydocs are 
correct, in which case can the site docs get updated so things are clearer? Or 
can both sets of documents be updated with the instructions for how to use S3 
if it is indeed possible to use S3 for the staging bucket?
   
   The last thing is that if I leave the trailing / off the bucket name, Beam 
complains and doesn't even run the pipeline. But as you can see above, this 
ends up creating a path with double-slashes: `s3://my-bucket//sf_copy...`. I 
doubt it has anything to do with the error I'm encountering, but it would be 
nice if that was fixed so that it's easier to find the files in S3. Not sure if 
this also occurs for GCS.
   
   I've also posted this to the `[email protected]` mailing list, but I'm 
not sure if I first needed to register with the mailing list before sending the 
email so I'm also posting this issue just in case.
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [x] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [x] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug]: Unable to use S3 bucket for ReadFromSnowflake staging bucket name [beam]

Reply via email to