Re: Samza as a Caching layer

Yan Fang Tue, 02 Sep 2014 11:43:23 -0700

Hi Shekar.

Are you running job in local mode or yarn mode? If yarn mode, the log is in
the yarn's container log.


Thanks,

Fang, Yan
[email protected]
+1 (206) 849-4108


On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur <[email protected]> wrote:

> Chris,
>
> Got some time to play around a bit more.
> I tried to edit
>
> samza-wikipedia/src/main/java/samza/examples/wikipedia/task/WikipediaFeedStreamTask.java
> to add a logger info statement to tap the incoming message. I dont see the
> messages being printed to the log file.
>
> Is this the right place to start?
>
> public class WikipediaFeedStreamTask implements StreamTask {
>
>   private static final SystemStream OUTPUT_STREAM = new
> SystemStream("kafka",
> "wikipedia-raw");
>
>   private static final Logger log = LoggerFactory.getLogger
> (WikipediaFeedStreamTask.class);
>
>   @Override
>
>   public void process(IncomingMessageEnvelope envelope, MessageCollector
> collector, TaskCoordinator coordinator) {
>
>     Map<String, Object> outgoingMap =
> WikipediaFeedEvent.toMap((WikipediaFeedEvent) envelope.getMessage());
>
>     log.info(envelope.getMessage().toString());
>
>     collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM,
> outgoingMap));
>
>   }
>
> }
>
>
> On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini <
> [email protected]> wrote:
>
> > Hey Shekar,
> >
> > Your thought process is on the right track. It's probably best to start
> > with hello-samza, and modify it to get what you want. To start with,
> > you'll want to:
> >
> > 1. Write a simple StreamTask that just does something silly like just
> > print messages that it receives.
> > 2. Write a configuration for the job that consumes from just the stream
> > (alerts from different sources).
> > 3. Run this to make sure you've got it working.
> > 4. Now add your table join. This can be either a change-data capture
> (CDC)
> > stream, or via a remote DB call.
> >
> > That should get you to a point where you've got your job up and running.
> > From there, you could create your own Maven project, and migrate your
> code
> > over accordingly.
> >
> > Cheers,
> > Chris
> >
> > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]> wrote:
> >
> > >Chris,
> > >
> > >I have gone thro the documentation and decided that the option that is
> > >most
> > >suitable for me is stream-table.
> > >
> > >I see the following things:
> > >
> > >1. Point samza to a table (database)
> > >2. Point Samza to a stream - Alert stream from different sources
> > >3. Join key like a hostname
> > >
> > >I have Hello Samza working. To extend that to do what my needs are, I am
> > >not sure where to start (Needs more code change OR configuration changes
> > >OR
> > >both)?
> > >
> > >I have gone thro
> > >
> http://samza.incubator.apache.org/learn/documentation/latest/api/overview
> > .
> > >html
> > >
> > >Is my thought process on the right track? Can you please point me to the
> > >right direction?
> > >
> > >- Shekar
> > >
> > >
> > >
> > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur <[email protected]>
> wrote:
> > >
> > >> Chris,
> > >>
> > >> This is perfectly good answer. I will start poking more into option
> #4.
> > >>
> > >> - Shekar
> > >>
> > >>
> > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini <
> > >> [email protected]> wrote:
> > >>
> > >>> Hey Shekar,
> > >>>
> > >>> Your two options are really (3) or (4), then. You can either run some
> > >>> external DB that holds the data set, and you can query it from a
> > >>> StreamTask, or you can use Samza's state store feature to push data
> > >>>into a
> > >>> stream that you can then store in a partitioned key-value store along
> > >>>with
> > >>> your StreamTasks. There is some documentation here about the state
> > >>>store
> > >>> approach:
> > >>>
> > >>>
> > >>>
> > >>>
> > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
> > >>>ate
> > >>> -management.html
> > >>>
> > >>><
> > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/s
> > >>>tate-management.html>
> > >>>
> > >>>
> > >>> (4) is going to require more up front effort from you, since you'll
> > >>>have
> > >>> to understand how Kafka's partitioning model works, and setup some
> > >>> pipeline to push the updates for your state. In the long run, I
> believe
> > >>> it's the better approach, though. Local lookups on a key-value store
> > >>> should be faster than doing remote RPC calls to a DB for every
> message.
> > >>>
> > >>> I'm sorry I can't give you a more definitive answer. It's really
> about
> > >>> trade-offs.
> > >>>
> > >>> Cheers,
> > >>> Chris
> > >>>
> > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote:
> > >>>
> > >>> >Chris,
> > >>> >
> > >>> >A big thanks for a swift response. The data set is huge and the
> > >>>frequency
> > >>> >is in burst.
> > >>> >What do you suggest?
> > >>> >
> > >>> >- Shekar
> > >>> >
> > >>> >
> > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini <
> > >>> >[email protected]> wrote:
> > >>> >
> > >>> >> Hey Shekar,
> > >>> >>
> > >>> >> This is feasible, and you are on the right thought process.
> > >>> >>
> > >>> >> For the sake of discussion, I'm going to pretend that you have a
> > >>>Kafka
> > >>> >> topic called "PageViewEvent", which has just the IP address that
> was
> > >>> >>used
> > >>> >> to view a page. These messages will be logged every time a page
> view
> > >>> >> happens. I'm also going to pretend that you have some state called
> > >>> >>"IPGeo"
> > >>> >> (e.g. The maxmind data set). In this example, we'll want to join
> the
> > >>> >> long/lat geo information from IPGeo to the PageViewEvent, and send
> > >>>it
> > >>> >>to a
> > >>> >> new topic: PageViewEventsWithGeo.
> > >>> >>
> > >>> >> You have several options on how to implement this example.
> > >>> >>
> > >>> >> 1. If your joining data set (IPGeo) is relatively small and
> changes
> > >>> >> infrequently, you can just pack it up in your jar or .tgz file,
> and
> > >>> open
> > >>> >> it open in every StreamTask.
> > >>> >> 2. If your data set is small, but changes somewhat frequently, you
> > >>>can
> > >>> >> throw the data set on some HTTP/HDFS/S3 server somewhere, and have
> > >>>your
> > >>> >> StreamTask refresh it periodically by re-downloading it.
> > >>> >> 3. You can do remote RPC calls for the IPGeo data on every page
> view
> > >>> >>event
> > >>> >> by query some remote service or DB (e.g. Cassandra).
> > >>> >> 4. You can use Samza's state feature to set your IPGeo data as a
> > >>>series
> > >>> >>of
> > >>> >> messages to a log-compacted Kafka topic
> > >>> >> (https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction
> ),
> > >>> and
> > >>> >> configure your Samza job to read this topic as a bootstrap stream
> > >>> >> (
> > >>> >>
> > >>> >>
> > >>>
> > >>>
> > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
> > >>>r
> > >>> >>e
> > >>> >> ams.html).
> > >>> >>
> > >>> >> For (4), you'd have to partition the IPGeo state topic according
> to
> > >>>the
> > >>> >> same key as PageViewEvent. If PageViewEvent were partitioned by,
> > >>>say,
> > >>> >> member ID, but you want your IPGeo state topic to be partitioned
> by
> > >>>IP
> > >>> >> address, then you'd have to have an upstream job that
> re-partitioned
> > >>> >> PageViewEvent into some new topic by IP address. This new topic
> will
> > >>> >>have
> > >>> >> to have the same number of partitions as the IPGeo state topic (if
> > >>> IPGeo
> > >>> >> has 8 partitions, then the new PageViewEventRepartitioned topic
> > >>>needs 8
> > >>> >>as
> > >>> >> well). This will cause your PageViewEventRepartitioned topic and
> > >>>your
> > >>> >> IPGeo state topic to be aligned such that the StreamTask that gets
> > >>>page
> > >>> >> views for IP address X will also have the IPGeo information for IP
> > >>> >>address
> > >>> >> X.
> > >>> >>
> > >>> >> Which strategy you pick is really up to you. :) (4) is the most
> > >>> >> complicated, but also the most flexible, and most operationally
> > >>>sound.
> > >>> >>(1)
> > >>> >> is the easiest if it fits your needs.
> > >>> >>
> > >>> >> Cheers,
> > >>> >> Chris
> > >>> >>
> > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]> wrote:
> > >>> >>
> > >>> >> >Hello,
> > >>> >> >
> > >>> >> >I am new to Samza. I have just installed Hello Samza and got it
> > >>> >>working.
> > >>> >> >
> > >>> >> >Here is the use case for which I am trying to use Samza:
> > >>> >> >
> > >>> >> >
> > >>> >> >1. Cache the contextual information which contains more
> information
> > >>> >>about
> > >>> >> >the hostname or IP address using Samza/Yarn/Kafka
> > >>> >> >2. Collect alert and metric events which contain either hostname
> > >>>or IP
> > >>> >> >address
> > >>> >> >3. Append contextual information to the alert and metric and
> > >>>insert to
> > >>> >>a
> > >>> >> >Kafka queue from which other subscribers read off of.
> > >>> >> >
> > >>> >> >Can you please shed some light on
> > >>> >> >
> > >>> >> >1. Is this feasible?
> > >>> >> >2. Am I on the right thought process
> > >>> >> >3. How do I start
> > >>> >> >
> > >>> >> >I now have 1 & 2 of them working disparately. I need to integrate
> > >>> them.
> > >>> >> >
> > >>> >> >Appreciate any input.
> > >>> >> >
> > >>> >> >- Shekar
> > >>> >>
> > >>> >>
> > >>>
> > >>>
> > >>
> >
> >
>

Re: Samza as a Caching layer

Reply via email to