Running Apache Samza on Kubernetes cluster with Zookeeper

2019-04-30 Thread Stefano Rebora
Hello,

I’m trying to develop Apache Samza on a Kubernetes cluster using Zookeeper for 
coordination.
Is there a suggested or working example to follow?

Thank you in advance,

Stefano



Re: Running Apache Samza on Kubernetes cluster with Zookeeper

2019-04-30 Thread Jagadish Venkatraman
+Weiqing Yang, who gave a community talk on this at KubeCon this year.

Hi Stefano,

Running Samza-standalone + Zk on Kubernetes should be no different than
running any other application on Kubernetes.

At a high-level, you would:


   1. Package your application as a container image - just as you would for
   any other application.
   2. Deploy it using Kubectl

>From this post  sharing how to
run Samza + Zk on K8s (It is prior to Samza 1.1 - please skip the section
on low-level jobs)

*"By comparison, using the high-level StreamApplication API is considerably
simpler. There's actually very little to it: just package the application
in a way that can be executed and deploy it using StatefulSet directly,
instead of via the KubernetesJob.*

*Running gradle dockerDistTar will produce a
folder example/build/docker which contains a Dockerfile and everything
necessary to run the example application in Kubernetes. Running gradle
dockerBuildImage will install said image locally, at which point you can
run kubectl create -f example/k8s/app.yaml*

*If you're planning to use the KV store with a volume mount, keep in mind
that Samza hard-codes the location to {user.dir}/state. The combination of
the generated Docker image and app.yaml mount point works because the image
runs the application from / and the mount point is /state."*
A related (a bit-outdated) email thread:
http://mail-archives.apache.org/mod_mbox/samza-dev/201802.mbox/%3ccamd3yjgpannxxnwyzkz-eqnvj3_ukpf8aak7qubt9ck9zjh...@mail.gmail.com%3E


On Tue, Apr 30, 2019 at 8:46 AM Stefano Rebora  wrote:

> Hello,
>
> I’m trying to develop Apache Samza on a Kubernetes cluster using Zookeeper
> for coordination.
> Is there a suggested or working example to follow?
>
> Thank you in advance,
>
> Stefano
>
>


Re: Samza for text processing

2019-04-30 Thread Jagadish Venkatraman
Hi Rob,

*>> We are currently deciding between kafka streams and*

*Samza. Which do you think would be more appropriate?*

Roughly, the two are similar - The design of Samza certainly influenced
what went
into Kafka Streams. However, here are some key differences:

- Support for non-Kafka source and sink natively: Samza has native
connectors
for various systems like ElasticSearch, AWS Kinesis, Azure EventHubs, HDFS
in the
open-source. This saves cost if you don't want to maintain dual copies and
import
the data into Kafka.

- Async-mode: At LinkedIn, we have observed that jobs are bottle-necked by
remote I/O.
For this reason, we built native async-processing into Samza. As far as I
can remember
, Samza is the only stream processor that supports this feature.

- Stability at LinkedIn: We run Samza in production at LinkedIn, and it's
battle-tested at scale
powering all of our near-realtime processing use-cases. Samza supports
durable local
state and host-affinity for state recovery. We have made improvements to
this by
adding incremental checkpointing.

- Single API and SQL for streaming and batch processing: Samza can run the
same code on
both batching and streaming sources. We've also added SQL support in the
open-source.

Related threads:
[1]
http://mail-archives.apache.org/mod_mbox/samza-dev/201608.mbox/%3CCAFvExu1KghxR1dN7Awwr70k3b4aMmfBVLhKFjFd2smsUAt3rDg%40mail.gmail.com%3E
[2] http://mail-archives.apache.org/mod_mbox/samza
-dev/201605.mbox/%3CCACsAj_XZZBohSz7Cf9%3DLO5MDOn2vEzfMrDF6Te%3DwrpeMEab1dQ%40mail.gmail.com%3E

*>> for files over 1mb would you increase the default kafka limit? Break*
*the document into chunks or pass a reference in the message?*

It depends - all are valid options;

At LinkedIn-scale, we have observed that keeping default Kafka limit has
served us well.
For this reason, we've built and open-sourced
 a compatible
KafkaClient that supports
chunking + assembly of large messages.

You can also store documents > 1mb into a store and pass them by reference.
Samza offers
built-in support for parallel/async I/O - So, if you're ok with querying
your document-store
this maybe simpler.



On Sun, Apr 28, 2019 at 10:33 AM Rob Martin  wrote:

> Thanks for the reply. We are currently deciding between kafka streams and
> Samza. Which do you think would be more appropriate?
>
> Also for files over 1mb would you increase the default kafka limit? Break
> the document into chunks or pass a reference in the message?
>
> Thanks again
>
>
>
> On Sun, 28 Apr 2019, 16:20 Jagadish Venkatraman, 
> wrote:
>
> > Hi Rob,
> >
> > Yes, your use-case is a good fit. You can use Samza for fault-tolerant
> > stream processing.
> >
> > We have document (eg: member profiles, articles/blogs) standardization
> > use-cases at LinkedIn powered by Samza.
> >
> > Please let us know should you have further questions!
> >
> > On Sun, Apr 28, 2019 at 7:09 AM Rob Martin 
> wrote:
> >
> > > Im looking at creating a distributed steaming pipeline for processing
> > text
> > > documents (eg cleaning, NER and machine learning). Documents will
> > generally
> > > be under 1mb and processing will be stateless. Was aiming to feed
> > documents
> > > from various sources and additional data into Kafka to be streamed to
> the
> > > proccing pipeline in Samza. Would this be an appropriate use case for
> > > Samza?
> > >
> >
> >
> > --
> > Jagadish V,
> > Graduate Student,
> > Department of Computer Science,
> > Stanford University
> >
>


-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University