Re: Samza as a Caching layer

Chris Riccomini Tue, 02 Sep 2014 12:04:38 -0700

Hey Shekar,

Can you attach your log files? I'm wondering if it's a mis-configured
log4j.xml (or missing slf4j-log4j jar), which is leading to nearly empty
log files. Also, I'm wondering if the job starts fully. Anything you can
attach would be helpful.


Cheers,
Chris

On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> wrote:

>I am running in local mode.
>
>S
>On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]> wrote:
>
>> Hi Shekar.
>>
>> Are you running job in local mode or yarn mode? If yarn mode, the log
>>is in
>> the yarn's container log.
>>
>> Thanks,
>>
>> Fang, Yan
>> [email protected]
>> +1 (206) 849-4108
>>
>>
>> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur <[email protected]>
>>wrote:
>>
>> > Chris,
>> >
>> > Got some time to play around a bit more.
>> > I tried to edit
>> >
>> >
>> 
>>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/WikipediaFeed
>>StreamTask.java
>> > to add a logger info statement to tap the incoming message. I dont see
>> the
>> > messages being printed to the log file.
>> >
>> > Is this the right place to start?
>> >
>> > public class WikipediaFeedStreamTask implements StreamTask {
>> >
>> >   private static final SystemStream OUTPUT_STREAM = new
>> > SystemStream("kafka",
>> > "wikipedia-raw");
>> >
>> >   private static final Logger log = LoggerFactory.getLogger
>> > (WikipediaFeedStreamTask.class);
>> >
>> >   @Override
>> >
>> >   public void process(IncomingMessageEnvelope envelope,
>>MessageCollector
>> > collector, TaskCoordinator coordinator) {
>> >
>> >     Map<String, Object> outgoingMap =
>> > WikipediaFeedEvent.toMap((WikipediaFeedEvent) envelope.getMessage());
>> >
>> >     log.info(envelope.getMessage().toString());
>> >
>> >     collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM,
>> > outgoingMap));
>> >
>> >   }
>> >
>> > }
>> >
>> >
>> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini <
>> > [email protected]> wrote:
>> >
>> > > Hey Shekar,
>> > >
>> > > Your thought process is on the right track. It's probably best to
>>start
>> > > with hello-samza, and modify it to get what you want. To start with,
>> > > you'll want to:
>> > >
>> > > 1. Write a simple StreamTask that just does something silly like
>>just
>> > > print messages that it receives.
>> > > 2. Write a configuration for the job that consumes from just the
>>stream
>> > > (alerts from different sources).
>> > > 3. Run this to make sure you've got it working.
>> > > 4. Now add your table join. This can be either a change-data capture
>> > (CDC)
>> > > stream, or via a remote DB call.
>> > >
>> > > That should get you to a point where you've got your job up and
>> running.
>> > > From there, you could create your own Maven project, and migrate
>>your
>> > code
>> > > over accordingly.
>> > >
>> > > Cheers,
>> > > Chris
>> > >
>> > > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]> wrote:
>> > >
>> > > >Chris,
>> > > >
>> > > >I have gone thro the documentation and decided that the option
>>that is
>> > > >most
>> > > >suitable for me is stream-table.
>> > > >
>> > > >I see the following things:
>> > > >
>> > > >1. Point samza to a table (database)
>> > > >2. Point Samza to a stream - Alert stream from different sources
>> > > >3. Join key like a hostname
>> > > >
>> > > >I have Hello Samza working. To extend that to do what my needs
>>are, I
>> am
>> > > >not sure where to start (Needs more code change OR configuration
>> changes
>> > > >OR
>> > > >both)?
>> > > >
>> > > >I have gone thro
>> > > >
>> >
>> 
>>http://samza.incubator.apache.org/learn/documentation/latest/api/overview
>> > > .
>> > > >html
>> > > >
>> > > >Is my thought process on the right track? Can you please point me
>>to
>> the
>> > > >right direction?
>> > > >
>> > > >- Shekar
>> > > >
>> > > >
>> > > >
>> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur <[email protected]>
>> > wrote:
>> > > >
>> > > >> Chris,
>> > > >>
>> > > >> This is perfectly good answer. I will start poking more into
>>option
>> > #4.
>> > > >>
>> > > >> - Shekar
>> > > >>
>> > > >>
>> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini <
>> > > >> [email protected]> wrote:
>> > > >>
>> > > >>> Hey Shekar,
>> > > >>>
>> > > >>> Your two options are really (3) or (4), then. You can either run
>> some
>> > > >>> external DB that holds the data set, and you can query it from a
>> > > >>> StreamTask, or you can use Samza's state store feature to push
>>data
>> > > >>>into a
>> > > >>> stream that you can then store in a partitioned key-value store
>> along
>> > > >>>with
>> > > >>> your StreamTasks. There is some documentation here about the
>>state
>> > > >>>store
>> > > >>> approach:
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>>
>> > >
>> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
>> > > >>>ate
>> > > >>> -management.html
>> > > >>>
>> > > >>><
>> > >
>> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/s
>> > > >>>tate-management.html>
>> > > >>>
>> > > >>>
>> > > >>> (4) is going to require more up front effort from you, since
>>you'll
>> > > >>>have
>> > > >>> to understand how Kafka's partitioning model works, and setup
>>some
>> > > >>> pipeline to push the updates for your state. In the long run, I
>> > believe
>> > > >>> it's the better approach, though. Local lookups on a key-value
>> store
>> > > >>> should be faster than doing remote RPC calls to a DB for every
>> > message.
>> > > >>>
>> > > >>> I'm sorry I can't give you a more definitive answer. It's really
>> > about
>> > > >>> trade-offs.
>> > > >>>
>> > > >>> Cheers,
>> > > >>> Chris
>> > > >>>
>> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote:
>> > > >>>
>> > > >>> >Chris,
>> > > >>> >
>> > > >>> >A big thanks for a swift response. The data set is huge and the
>> > > >>>frequency
>> > > >>> >is in burst.
>> > > >>> >What do you suggest?
>> > > >>> >
>> > > >>> >- Shekar
>> > > >>> >
>> > > >>> >
>> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini <
>> > > >>> >[email protected]> wrote:
>> > > >>> >
>> > > >>> >> Hey Shekar,
>> > > >>> >>
>> > > >>> >> This is feasible, and you are on the right thought process.
>> > > >>> >>
>> > > >>> >> For the sake of discussion, I'm going to pretend that you
>>have a
>> > > >>>Kafka
>> > > >>> >> topic called "PageViewEvent", which has just the IP address
>>that
>> > was
>> > > >>> >>used
>> > > >>> >> to view a page. These messages will be logged every time a
>>page
>> > view
>> > > >>> >> happens. I'm also going to pretend that you have some state
>> called
>> > > >>> >>"IPGeo"
>> > > >>> >> (e.g. The maxmind data set). In this example, we'll want to
>>join
>> > the
>> > > >>> >> long/lat geo information from IPGeo to the PageViewEvent, and
>> send
>> > > >>>it
>> > > >>> >>to a
>> > > >>> >> new topic: PageViewEventsWithGeo.
>> > > >>> >>
>> > > >>> >> You have several options on how to implement this example.
>> > > >>> >>
>> > > >>> >> 1. If your joining data set (IPGeo) is relatively small and
>> > changes
>> > > >>> >> infrequently, you can just pack it up in your jar or .tgz
>>file,
>> > and
>> > > >>> open
>> > > >>> >> it open in every StreamTask.
>> > > >>> >> 2. If your data set is small, but changes somewhat
>>frequently,
>> you
>> > > >>>can
>> > > >>> >> throw the data set on some HTTP/HDFS/S3 server somewhere, and
>> have
>> > > >>>your
>> > > >>> >> StreamTask refresh it periodically by re-downloading it.
>> > > >>> >> 3. You can do remote RPC calls for the IPGeo data on every
>>page
>> > view
>> > > >>> >>event
>> > > >>> >> by query some remote service or DB (e.g. Cassandra).
>> > > >>> >> 4. You can use Samza's state feature to set your IPGeo data
>>as a
>> > > >>>series
>> > > >>> >>of
>> > > >>> >> messages to a log-compacted Kafka topic
>> > > >>> >> (
>> https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction
>> > ),
>> > > >>> and
>> > > >>> >> configure your Samza job to read this topic as a bootstrap
>> stream
>> > > >>> >> (
>> > > >>> >>
>> > > >>> >>
>> > > >>>
>> > > >>>
>> > >
>> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
>> > > >>>r
>> > > >>> >>e
>> > > >>> >> ams.html).
>> > > >>> >>
>> > > >>> >> For (4), you'd have to partition the IPGeo state topic
>>according
>> > to
>> > > >>>the
>> > > >>> >> same key as PageViewEvent. If PageViewEvent were partitioned
>>by,
>> > > >>>say,
>> > > >>> >> member ID, but you want your IPGeo state topic to be
>>partitioned
>> > by
>> > > >>>IP
>> > > >>> >> address, then you'd have to have an upstream job that
>> > re-partitioned
>> > > >>> >> PageViewEvent into some new topic by IP address. This new
>>topic
>> > will
>> > > >>> >>have
>> > > >>> >> to have the same number of partitions as the IPGeo state
>>topic
>> (if
>> > > >>> IPGeo
>> > > >>> >> has 8 partitions, then the new PageViewEventRepartitioned
>>topic
>> > > >>>needs 8
>> > > >>> >>as
>> > > >>> >> well). This will cause your PageViewEventRepartitioned topic
>>and
>> > > >>>your
>> > > >>> >> IPGeo state topic to be aligned such that the StreamTask that
>> gets
>> > > >>>page
>> > > >>> >> views for IP address X will also have the IPGeo information
>>for
>> IP
>> > > >>> >>address
>> > > >>> >> X.
>> > > >>> >>
>> > > >>> >> Which strategy you pick is really up to you. :) (4) is the
>>most
>> > > >>> >> complicated, but also the most flexible, and most
>>operationally
>> > > >>>sound.
>> > > >>> >>(1)
>> > > >>> >> is the easiest if it fits your needs.
>> > > >>> >>
>> > > >>> >> Cheers,
>> > > >>> >> Chris
>> > > >>> >>
>> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]>
>>wrote:
>> > > >>> >>
>> > > >>> >> >Hello,
>> > > >>> >> >
>> > > >>> >> >I am new to Samza. I have just installed Hello Samza and
>>got it
>> > > >>> >>working.
>> > > >>> >> >
>> > > >>> >> >Here is the use case for which I am trying to use Samza:
>> > > >>> >> >
>> > > >>> >> >
>> > > >>> >> >1. Cache the contextual information which contains more
>> > information
>> > > >>> >>about
>> > > >>> >> >the hostname or IP address using Samza/Yarn/Kafka
>> > > >>> >> >2. Collect alert and metric events which contain either
>> hostname
>> > > >>>or IP
>> > > >>> >> >address
>> > > >>> >> >3. Append contextual information to the alert and metric and
>> > > >>>insert to
>> > > >>> >>a
>> > > >>> >> >Kafka queue from which other subscribers read off of.
>> > > >>> >> >
>> > > >>> >> >Can you please shed some light on
>> > > >>> >> >
>> > > >>> >> >1. Is this feasible?
>> > > >>> >> >2. Am I on the right thought process
>> > > >>> >> >3. How do I start
>> > > >>> >> >
>> > > >>> >> >I now have 1 & 2 of them working disparately. I need to
>> integrate
>> > > >>> them.
>> > > >>> >> >
>> > > >>> >> >Appreciate any input.
>> > > >>> >> >
>> > > >>> >> >- Shekar
>> > > >>> >>
>> > > >>> >>
>> > > >>>
>> > > >>>
>> > > >>
>> > >
>> > >
>> >
>>

Re: Samza as a Caching layer

Reply via email to