Hi Guys, I am in process of using the gobblin cluster to address the streaming use case, I have yet to look at the code. However I would like to validate my understanding and design approaches based of the quantum of data to be ingested via gobblin. Following is how I will classify the gobblin solutions based on quantum of data
Quantum of Data : - Small/Medium - This can be processed using the standalone mode. - Large - This can be processed using the MR or YARN mode. - Unbounded ( Stream) - Gobblin Cluster. A bounded data ( small/medium large) to be ingested using the Gobblin could have a meta data which will help the Source to partition the data and create the WorkUnits, however there are some cases we don't have metadata of the Source data so the partitionion of data upfront is not possible. In the later case we may Iterate the entire source and while doing so create the partitions, this is discussed here https://mail-archives.apache.org/mod_mbox/incubator-gobblin-user/201709.mbox/browser For the case of unbounded data we need to address following - Have a Gobblin nodes which will be processing the partitioned data. - The data should be partitioned so that it could be processed faster and continuously. - Fault tolerance. - Scaling of Gobblin processing nodes. Currently it seems that the unbounded data use case is handled via gobblin cluster and kafka. Here is how it is addressed as of now - The unbounded data is pushed to the Kafka which does partition the data. - The Source implementation can create the WorkUnits which will read the data per kafka partition. - The starting of the Job creates the workunits based on the existing partitons. For each partition there is a Task pushed to the Distributed Tasq queue based on Helix. - Gobblin Cluster is based on the Master/Worker architecture and using Helix Task Framework under the hood. I would like to hear more about the usage patterns from the community and developer team so that I can consolidate the information and can post it to wiki for the use of others too. Thanks, Vicky
