Hi Vicky, This is a good summary. While more people will add their story, why don't you start a wiki page with these details. I will also chime in with different use-cases we use Gobblin for once I clear a bit of my backlog.
Regards Abhishek On Wed, Mar 28, 2018 at 3:29 AM, Vicky Kak <[email protected]> wrote: > Hi Guys, > > I am in process of using the gobblin cluster to address the streaming use > case, I have yet to look at the code. However I would like to validate my > understanding and design approaches based of the quantum of data to be > ingested via > gobblin. Following is how I will classify the gobblin solutions based on > quantum of data > > Quantum of Data : > > - Small/Medium - This can be processed using the standalone mode. > - Large - This can be processed using the MR or YARN mode. > - Unbounded ( Stream) - Gobblin Cluster. > > A bounded data ( small/medium large) to be ingested using the Gobblin > could have a meta data which will help the Source to partition the data and > create the WorkUnits, however there are some cases we don't have metadata > of the Source data so the partitionion of data upfront is not possible. In > the later case we may Iterate the entire source and while doing so create > the partitions, this is discussed here > https://mail-archives.apache.org/mod_mbox/incubator- > gobblin-user/201709.mbox/browser > > For the case of unbounded data we need to address following > - Have a Gobblin nodes which will be processing the partitioned data. > - The data should be partitioned so that it could be processed faster and > continuously. > - Fault tolerance. > - Scaling of Gobblin processing nodes. > > Currently it seems that the unbounded data use case is handled via gobblin > cluster and kafka. Here is how it is addressed as of now > - The unbounded data is pushed to the Kafka which does partition the data. > - The Source implementation can create the WorkUnits which will read the > data per kafka partition. > - The starting of the Job creates the workunits based on the existing > partitons. For each partition there is a Task pushed to the Distributed > Tasq queue based on Helix. > - Gobblin Cluster is based on the Master/Worker architecture and using > Helix Task Framework under the hood. > > I would like to hear more about the usage patterns from the community and > developer team so that I can consolidate the information and can post it to > wiki for the use of others too. > > Thanks, > Vicky > > > > >
