Hi Rob, *>> We are currently deciding between kafka streams and*
*Samza. Which do you think would be more appropriate?* Roughly, the two are similar - The design of Samza certainly influenced what went into Kafka Streams. However, here are some key differences: - Support for non-Kafka source and sink natively: Samza has native connectors for various systems like ElasticSearch, AWS Kinesis, Azure EventHubs, HDFS in the open-source. This saves cost if you don't want to maintain dual copies and import the data into Kafka. - Async-mode: At LinkedIn, we have observed that jobs are bottle-necked by remote I/O. For this reason, we built native async-processing into Samza. As far as I can remember , Samza is the only stream processor that supports this feature. - Stability at LinkedIn: We run Samza in production at LinkedIn, and it's battle-tested at scale powering all of our near-realtime processing use-cases. Samza supports durable local state and host-affinity for state recovery. We have made improvements to this by adding incremental checkpointing. - Single API and SQL for streaming and batch processing: Samza can run the same code on both batching and streaming sources. We've also added SQL support in the open-source. Related threads: [1] http://mail-archives.apache.org/mod_mbox/samza-dev/201608.mbox/%3CCAFvExu1KghxR1dN7Awwr70k3b4aMmfBVLhKFjFd2smsUAt3rDg%40mail.gmail.com%3E [2] http://mail-archives.apache.org/mod_mbox/samza -dev/201605.mbox/%3CCACsAj_XZZBohSz7Cf9%3DLO5MDOn2vEzfMrDF6Te%3DwrpeMEab1dQ%40mail.gmail.com%3E *>> for files over 1mb would you increase the default kafka limit? Break* *the document into chunks or pass a reference in the message?* It depends - all are valid options; At LinkedIn-scale, we have observed that keeping default Kafka limit has served us well. For this reason, we've built and open-sourced <https://github.com/linkedin/li-apache-kafka-clients> a compatible KafkaClient that supports chunking + assembly of large messages. You can also store documents > 1mb into a store and pass them by reference. Samza offers built-in support for parallel/async I/O - So, if you're ok with querying your document-store this maybe simpler. On Sun, Apr 28, 2019 at 10:33 AM Rob Martin <rob.mart...@gmail.com> wrote: > Thanks for the reply. We are currently deciding between kafka streams and > Samza. Which do you think would be more appropriate? > > Also for files over 1mb would you increase the default kafka limit? Break > the document into chunks or pass a reference in the message? > > Thanks again > > > > On Sun, 28 Apr 2019, 16:20 Jagadish Venkatraman, <jagadish1...@gmail.com> > wrote: > > > Hi Rob, > > > > Yes, your use-case is a good fit. You can use Samza for fault-tolerant > > stream processing. > > > > We have document (eg: member profiles, articles/blogs) standardization > > use-cases at LinkedIn powered by Samza. > > > > Please let us know should you have further questions! > > > > On Sun, Apr 28, 2019 at 7:09 AM Rob Martin <rob.mart...@gmail.com> > wrote: > > > > > Im looking at creating a distributed steaming pipeline for processing > > text > > > documents (eg cleaning, NER and machine learning). Documents will > > generally > > > be under 1mb and processing will be stateless. Was aiming to feed > > documents > > > from various sources and additional data into Kafka to be streamed to > the > > > proccing pipeline in Samza. Would this be an appropriate use case for > > > Samza? > > > > > > > > > -- > > Jagadish V, > > Graduate Student, > > Department of Computer Science, > > Stanford University > > > -- Jagadish V, Graduate Student, Department of Computer Science, Stanford University