I ran into the same issue and ended building a separate operator that works as you describe, though I haven't submitted it as a PR. Happy to share my implementation with you. I found that it's useful to have both ways of transferring data. Initially, I migrated all of my S3ToGCS tasks to use the transfer service, but I found that its performance can be unreliable with some combination of 1) transferring smaller datasets and 2) invoking many transfers in parallel. The transfer service is a bit of a black box, so when it doesn't work as expected you're stuck. Because of this, I ended up migrating some of my tasks to the original implementation. I would definitely keep both options around--I don't think I have a preference between new operator vs a param on the existing operator. Chris
On Fri, Oct 19, 2018, at 7:09 AM, Conrad Lee wrote: > Hello Airflow community, > > I'm interested in transferring data between S3 and Google Cloud > Storage. I> want to transfer data on the scale of hundreds of gigabytes to a > few > terrabytes. > > Airflow already has an operator that could be used for this use-case:> the > S3ToGoogleCloudStorageOperator. > However, looking over its implementation it appears that all the > data to be> transferred actually passes through the machine running airflow. > That> seems completely unnecessary to me, and will place a lot of > burden on the> airflow workers and will be bottlenecked by the bandwidth of > the > workers.> It could even lead to out of disk errors like this one > <https://stackoverflow.com/questions/52400144/airflow-s3togooglecloudstorageoperator-no-space-left-on-device>> > . > > I would much rather use Google Cloud's 'Transfer Service' for doing > this--that way the airflow operator just needs to make an API call and> > (optionally) keep polling the API until the transfer is done (this > last bit> could be done in a sensor). The heavy work of performing the > transfer is> offloaded to the Transfer Service. > > Was it an intentional design decision to avoid using the Google > Transfer> Service? If I create a PR that adds the ability to perform > transfers with> the Google Transfer Service, should it > > - replace the existing operator > - be an option on the existing operator (i.e., add an argument that> > toggles between 'local worker transfer' and 'google hosted > transfer')> - make a new operator > > Thanks, > Conrad Lee