Hi Yan,
I currently use one process in a Docker container. My assumption was that it
already spawned multiple processes so I was comfortable with that solution
until we have to scale some job beyond one machine.
Docker containers are restarted automatically on failure and we collect
metrics from the metrics log in Kafka in case there is any serious fault and
we need to be alerted. So far this worked fine, but our set up is fairly
small. I would like to see the ProcessJobFactory spawn multiple processes
automatically since the amount of data we're processing is growing quickly.
Lukas
-----Original Message-----
From: Yan Fang
Sent: Wednesday, September 16, 2015 3:45 PM
To: dev@samza.apache.org
Subject: Re: Runtime Execution Model
-- Hi Lukas,
I want to learn more from your production environment. How do you use
ProcessJobFactory
in Docker containers? Do you use one ProcessJobFactory process all the
tasks, or spawn out as many threads as the task number? How is the
fault-tolerance?
-- Hi Yi,
* Any progress in your side, in terms of the standalone job? (Chris' patch
is big, :)
* Invert the JobCoordinator to the standalone Samza process s.t. the leader
process of the Samza job becomes the JobCoordinator
Currently, we run the JobCoordinator first, and then Yarn talks to
the JobCoordinator. Isn't it enough so far?
* Make the partition assignment as pluggable model to distribute the tasks
to
all Samza processes in a job-group in coordination.
I think the reason for this is for the Kafka's new feature.The API
design needs to be compatible with Kafka.
* Make Samza process multi-threaded while maintaining the per-task
single-threaded
programming model for the users
Do we already have this, or need to add that? This I think can be
done in current ProcessJob. We can have the same number of threads as the
tasks.
Thanks,
Fang, Yan
yanfang...@gmail.com
On Tue, Sep 15, 2015 at 10:54 AM, Yi Pan <nickpa...@gmail.com> wrote:
Hi, all,
Thanks for pitching in for the improvement plan. We have actually
discussed
this for a while now. In a complete view, I think that there are the
following issues need to be addressed:
1) Currently, the steps involved to launch a Samza process are too complex
and intertwined with YARN.
2) The Samza partition assignment is embedded within YARN AppMaster
implementation, which makes it difficult to run the job outside YARN
environment
We have actually already started some work to address the above issues:
1) SAMZA-516: support standalone Samza jobs. Chris has started this work
and has a proto-type patch available. This allows a ZK-based coordination
to start standalone Samza processes w/o YARN
There are also planned changes to allow de-coupling of Samza job
coordination logic from YARN AppMaster:
1) SAMZA-680 Invert the JobCoordinator and AM logic. This would allow us
to
keep the Samza-specific JobCoordinator logic independent from
cluster-management systems.
There is one more thing I am thinking: we may want to make the partition
assignment logic as a pluggable module, such that we can choose different
coordination mechanism in partition assignment as needed (e.g. ZK-based,
cluster-management based, or Kafka-based coordination).
Ultimately, I think that we should try to refactor the current job
launching model to the following:
1) Make standalone Samza process the standard Samza process model
2) Invert the JobCoordinator to the standalone Samza process s.t. the
leader process of the Samza job becomes the JobCoordinator
3) Make the partition assignment as pluggable model to distribute the
tasks
to all Samza processes in a job-group in coordination
4) Make launching of Samza process agnostic of cluster-management systems.
The cluster-management systems will simply provide the functionality of
placing the standard Samza processes to actual available nodes
5) Make Samza process multi-threaded while maintaining the per-task
single-threaded programming model for the users.
Thoughts?
-Yi
On Tue, Sep 15, 2015 at 9:50 AM, Hannes Stockner <
hannes.stock...@gmail.com>
wrote:
> +1
>
>
> On Tue, Sep 15, 2015 at 5:43 PM, Bruno Bonacci <bruno.bona...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I support what Lukas saying. Samza packaging requirements are not
> friendly,
> > I use the ThreadJobFactory for the same reason.
> >
> > Bruno
> >
> > On Tue, Sep 15, 2015 at 5:39 PM, Lukas Steiblys <lu...@doubledutch.me>
> > wrote:
> >
> > > Hi Yan,
> > >
> > > We use Samza in a production environment using ProcessJobFactory in
> > Docker
> > > containers because it greatly simplifies our deployment process and
> makes
> > > much better use of resources.
> > >
> > > Is there any plan to make the ThreadJobFactory or ProcessJobFactory
> > > multithreaded? I will look into doing that myself, but I think it
might
> > be
> > > useful to implement this for everyone. I am sure there are plenty of
> > cases
> > > where people do not want to use YARN, but want more parallelism in
> their
> > > tasks.
> > >
> > > Lukas
> > >
> > > -----Original Message----- From: Yan Fang
> > > Sent: Monday, September 14, 2015 11:08 AM
> > > To: dev@samza.apache.org
> > > Subject: Re: Runtime Execution Model
> > >
> > >
> > > Hi Bruno,
> > >
> > > AFAIK, there is no existing JobFactory that brings as many threads
> > > as
> the
> > > partition number. But I think nothing stops you to implement this:
you
> > can
> > > get the partition information from the JobCoordinator, and then
> > > bring
> as
> > > many threads as the partition/task number.
> > >
> > > Since the two local factories (ThreadJobFactory and
ProcessJobFactory)
> > are
> > > mainly for development, there is no additional document. But most of
> the
> > > code here
> > > <
> > >
> >
>
https://github.com/apache/samza/tree/master/samza-core/src/main/scala/org/apache/samza/job/local
> > > >
> > > is
> > > self-explained.
> > >
> > > Thanks,
> > >
> > > Fang, Yan
> > > yanfang...@gmail.com
> > >
> > > On Sat, Sep 12, 2015 at 1:47 PM, Bruno Bonacci <
> bruno.bona...@gmail.com>
> > > wrote:
> > >
> > > Hi,
> > >> I'm looking for additional documentation on the different RUNTIME
> > >> EXECUTION MODELS of the different `job.factory.class`.
> > >>
> > >> I'm particularly interested on how each factory (ThreadJobFactory,
> > >> ProcessJobFactory and YarnJobFactory) will create tasks consume and
> > >> process
> > >> messages out of Kafka and the thread model used.
> > >>
> > >> I did a few tests with the ThreadJob factory consuming out of a
kafka
> > >> topic with 5 partitions and I was expecting that it would use
multiple
> > >> threads to consume/process the different partitions, however it is
> > >> using only one thread at runtime.
> > >>
> > >> Is there any way to tell Samza to use multiple processing threads
> > >> (1
> per
> > >> partition)??
> > >>
> > >>
> > >> Thanks
> > >> Bruno
> > >>
> > >>
> > >
> >
>