Hi Shirshanka, Thanks for explaining the possible workaround for this. I agree, it may add some burden and complexity to the platform, but given its done right, it may be helpful.
Let me try the suggested approach as well. Thanks Jay On Sat, Mar 30, 2019 at 12:31 PM Shirshanka Das <[email protected]> wrote: > Hi Jay, > > This would create unnecessary burden on Gobblin to support multi-cluster > data movement, buffering etc inside the infrastructure. From what I > remember, systems such as Nifi and Flume have attempted this and IMO > generally build a poor-man's version of Kafka inside the system which > doesn't have all the bells and whistles. > > The way I would recommend to solve this would be have two hops: > src -> Gobblin -> [globally replicated stream / blob store] --> Gobblin -> > dst > > The globally replicated stream/blob store could be Kafka / HDFS / S3 / > ADLS / etc. > > If you need to run the replication yourself, you might need to run another > Gobblin pipe between the two colos. > > We actually have pipelines in production that does: > src -> Gobblin -> HDFS (colo1) -> Gobblin (distcp-ng) -> HDFS (colo2) -> > Gobblin -> dst > > The Gobblin-as-a-service project automates the provisioning and > orchestration of these multi-hop pipelines. So you just say src -> Gobblin > -> dst and the intermediate hops are provisioned for you. > > HTH, > Shirshanka > > > > On Mon, Mar 25, 2019 at 11:53 PM Jayesh Senjaliya < > [email protected]> wrote: > > > Yes, exactly Shirshanka. > > > > typically companies with large infrastructure will have some levels of > > security zones in addition to different datacenter. > > > > if there is a source or target in higher security zone, you would have to > > open firewall for each source or target which could run in 100s or 1000s > of > > firewall rules. > > instead, if we can have cluster to cluster communication ( activated > based > > on given source/target), then we only need firewall between Gobblin > > cluster. > > > > so as you rightly understood it would be src --> Gobblin > > --(cross-colo)--->Gobblin --> dst > > > > any idea if this is doable in Gobblin? > > > > Thanks > > Jay > > > > > > On Mon, Mar 25, 2019 at 11:03 PM Shirshanka Das <[email protected]> > > wrote: > > > > > Hi Jay, > > > That's an interesting idea. Can you describe the use-case a bit more? > > > Do you have a use-case where you are pulling from a source into a > > > specific data-center and then need to ship data to the destination > from a > > > different data-center or security zone? > > > > > > So are you proposing a pipeline like: src --> Gobblin > --(cross-colo)---> > > > Gobblin --> dst ? > > > > > > versus a typical: src --> Gobblin --> dst > > > > > > > > > thanks, > > > Shirshanka > > > > > > > > > > > > On Mon, Mar 25, 2019 at 8:32 PM Jay Sen <[email protected]> wrote: > > > > > > > Hi Gobblin Devs, > > > > > > > > is there a way to do data movement between Gobblin clusters to > support > > > > read/write of data across data centers or security zones? so that > > > > firewall/encryption and other tight measures only requires between > > > Gobblin > > > > clusters and not from/to all source/targets ? > > > > > > > > > > > > Thanks > > > > Jay > > > > > > > > > >
