Re: Samza for text processing

Jagadish Venkatraman Tue, 30 Apr 2019 10:39:13 -0700

Hi Rob,

*>> We are currently deciding between kafka streams and*


*Samza. Which do you think would be more appropriate?*

Roughly, the two are similar - The design of Samza certainly influenced
what went
into Kafka Streams. However, here are some key differences:

- Support for non-Kafka source and sink natively: Samza has native
connectors
for various systems like ElasticSearch, AWS Kinesis, Azure EventHubs, HDFS
in the
open-source. This saves cost if you don't want to maintain dual copies and
import
the data into Kafka.

- Async-mode: At LinkedIn, we have observed that jobs are bottle-necked by
remote I/O.
For this reason, we built native async-processing into Samza. As far as I
can remember
, Samza is the only stream processor that supports this feature.

- Stability at LinkedIn: We run Samza in production at LinkedIn, and it's
battle-tested at scale
powering all of our near-realtime processing use-cases. Samza supports
durable local
state and host-affinity for state recovery. We have made improvements to
this by
adding incremental checkpointing.

- Single API and SQL for streaming and batch processing: Samza can run the
same code on
both batching and streaming sources. We've also added SQL support in the
open-source.

Related threads:
[1]
http://mail-archives.apache.org/mod_mbox/samza-dev/201608.mbox/%3CCAFvExu1KghxR1dN7Awwr70k3b4aMmfBVLhKFjFd2smsUAt3rDg%40mail.gmail.com%3E
[2] http://mail-archives.apache.org/mod_mbox/samza
-dev/201605.mbox/%3CCACsAj_XZZBohSz7Cf9%3DLO5MDOn2vEzfMrDF6Te%3DwrpeMEab1dQ%40mail.gmail.com%3E

*>> for files over 1mb would you increase the default kafka limit? Break*
*the document into chunks or pass a reference in the message?*

It depends - all are valid options;

At LinkedIn-scale, we have observed that keeping default Kafka limit has
served us well.
For this reason, we've built and open-sourced
<https://github.com/linkedin/li-apache-kafka-clients> a compatible
KafkaClient that supports
chunking + assembly of large messages.

You can also store documents > 1mb into a store and pass them by reference.
Samza offers
built-in support for parallel/async I/O - So, if you're ok with querying
your document-store
this maybe simpler.



On Sun, Apr 28, 2019 at 10:33 AM Rob Martin <rob.mart...@gmail.com> wrote:

> Thanks for the reply. We are currently deciding between kafka streams and
> Samza. Which do you think would be more appropriate?
>
> Also for files over 1mb would you increase the default kafka limit? Break
> the document into chunks or pass a reference in the message?
>
> Thanks again
>
>
>
> On Sun, 28 Apr 2019, 16:20 Jagadish Venkatraman, <jagadish1...@gmail.com>
> wrote:
>
> > Hi Rob,
> >
> > Yes, your use-case is a good fit. You can use Samza for fault-tolerant
> > stream processing.
> >
> > We have document (eg: member profiles, articles/blogs) standardization
> > use-cases at LinkedIn powered by Samza.
> >
> > Please let us know should you have further questions!
> >
> > On Sun, Apr 28, 2019 at 7:09 AM Rob Martin <rob.mart...@gmail.com>
> wrote:
> >
> > > Im looking at creating a distributed steaming pipeline for processing
> > text
> > > documents (eg cleaning, NER and machine learning). Documents will
> > generally
> > > be under 1mb and processing will be stateless. Was aiming to feed
> > documents
> > > from various sources and additional data into Kafka to be streamed to
> the
> > > proccing pipeline in Samza. Would this be an appropriate use case for
> > > Samza?
> > >
> >
> >
> > --
> > Jagadish V,
> > Graduate Student,
> > Department of Computer Science,
> > Stanford University
> >
>


-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: Samza for text processing

Reply via email to