Gobblin Clustering for Streaming.

Vicky Kak Wed, 28 Mar 2018 03:29:40 -0700

Hi Guys,

I am in process of using the gobblin cluster to address the streaming use
case, I have yet to look at the code. However I would like to validate my
understanding and design approaches based of the quantum of data to be
ingested via
gobblin. Following is how I will classify the gobblin solutions based on
quantum of data


Quantum of Data :

- Small/Medium - This can be processed using the standalone mode.
- Large - This can be processed using the MR or YARN mode.
- Unbounded ( Stream) - Gobblin Cluster.

A bounded data ( small/medium large) to be ingested using the Gobblin could
have a meta data which will help the Source to partition the data and
create the WorkUnits, however there are some cases we don't have metadata
of the Source data so the partitionion of data upfront is not possible. In
the later case we may Iterate the entire source and while doing so create
the partitions, this is discussed here
https://mail-archives.apache.org/mod_mbox/incubator-gobblin-user/201709.mbox/browser

For the case of unbounded data we need to address following
- Have a Gobblin nodes which will be processing the partitioned data.
- The data should be partitioned so that it could be processed faster and
continuously.
- Fault tolerance.
- Scaling of Gobblin processing nodes.

Currently it seems that the unbounded data use case is handled via gobblin
cluster and kafka. Here is how it is addressed as of now
- The unbounded data is  pushed to the Kafka which does partition the data.
- The Source implementation can create the WorkUnits which will read the
data per kafka partition.
- The starting of the Job creates the workunits based on the existing
partitons. For each partition there is a Task pushed to the Distributed
Tasq queue based on Helix.
- Gobblin Cluster is based on the Master/Worker architecture and using
Helix Task Framework under the hood.

I would like to hear more about the usage patterns from the community and
developer team so that I can consolidate the information and can post it to
wiki for the use of others too.

Thanks,
Vicky

Gobblin Clustering for Streaming.

Reply via email to