Re: Samza as a Caching layer

Shekar Tippur Tue, 02 Sep 2014 12:23:43 -0700

Chris ..

$ cat ./deploy/samza/undefined-samza-container-name.log


2014-09-02 11:17:58 JobRunner [INFO] job factory:
org.apache.samza.job.yarn.YarnJobFactory

2014-09-02 11:17:59 ClientHelper [INFO] trying to connect to RM
127.0.0.1:8032

2014-09-02 11:17:59 NativeCodeLoader [WARN] Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

2014-09-02 11:17:59 RMProxy [INFO] Connecting to ResourceManager at /
127.0.0.1:8032


and Log4j ..

<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">

<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/";>

  <appender name="RollingAppender"
class="org.apache.log4j.DailyRollingFileAppender">

     <param name="File" value="${samza.log.dir}/${samza.container.name}.log"
/>

     <param name="DatePattern" value="'.'yyyy-MM-dd" />

     <layout class="org.apache.log4j.PatternLayout">

      <param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} %c{1}
[%p] %m%n" />

     </layout>

  </appender>

  <root>

    <priority value="info" />

    <appender-ref ref="RollingAppender"/>

  </root>

</log4j:configuration>


On Tue, Sep 2, 2014 at 12:02 PM, Chris Riccomini <
[email protected]> wrote:

> Hey Shekar,
>
> Can you attach your log files? I'm wondering if it's a mis-configured
> log4j.xml (or missing slf4j-log4j jar), which is leading to nearly empty
> log files. Also, I'm wondering if the job starts fully. Anything you can
> attach would be helpful.
>
> Cheers,
> Chris
>
> On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> wrote:
>
> >I am running in local mode.
> >
> >S
> >On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]> wrote:
> >
> >> Hi Shekar.
> >>
> >> Are you running job in local mode or yarn mode? If yarn mode, the log
> >>is in
> >> the yarn's container log.
> >>
> >> Thanks,
> >>
> >> Fang, Yan
> >> [email protected]
> >> +1 (206) 849-4108
> >>
> >>
> >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur <[email protected]>
> >>wrote:
> >>
> >> > Chris,
> >> >
> >> > Got some time to play around a bit more.
> >> > I tried to edit
> >> >
> >> >
> >>
> >>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/WikipediaFeed
> >>StreamTask.java
> >> > to add a logger info statement to tap the incoming message. I dont see
> >> the
> >> > messages being printed to the log file.
> >> >
> >> > Is this the right place to start?
> >> >
> >> > public class WikipediaFeedStreamTask implements StreamTask {
> >> >
> >> >   private static final SystemStream OUTPUT_STREAM = new
> >> > SystemStream("kafka",
> >> > "wikipedia-raw");
> >> >
> >> >   private static final Logger log = LoggerFactory.getLogger
> >> > (WikipediaFeedStreamTask.class);
> >> >
> >> >   @Override
> >> >
> >> >   public void process(IncomingMessageEnvelope envelope,
> >>MessageCollector
> >> > collector, TaskCoordinator coordinator) {
> >> >
> >> >     Map<String, Object> outgoingMap =
> >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent) envelope.getMessage());
> >> >
> >> >     log.info(envelope.getMessage().toString());
> >> >
> >> >     collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM,
> >> > outgoingMap));
> >> >
> >> >   }
> >> >
> >> > }
> >> >
> >> >
> >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini <
> >> > [email protected]> wrote:
> >> >
> >> > > Hey Shekar,
> >> > >
> >> > > Your thought process is on the right track. It's probably best to
> >>start
> >> > > with hello-samza, and modify it to get what you want. To start with,
> >> > > you'll want to:
> >> > >
> >> > > 1. Write a simple StreamTask that just does something silly like
> >>just
> >> > > print messages that it receives.
> >> > > 2. Write a configuration for the job that consumes from just the
> >>stream
> >> > > (alerts from different sources).
> >> > > 3. Run this to make sure you've got it working.
> >> > > 4. Now add your table join. This can be either a change-data capture
> >> > (CDC)
> >> > > stream, or via a remote DB call.
> >> > >
> >> > > That should get you to a point where you've got your job up and
> >> running.
> >> > > From there, you could create your own Maven project, and migrate
> >>your
> >> > code
> >> > > over accordingly.
> >> > >
> >> > > Cheers,
> >> > > Chris
> >> > >
> >> > > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]> wrote:
> >> > >
> >> > > >Chris,
> >> > > >
> >> > > >I have gone thro the documentation and decided that the option
> >>that is
> >> > > >most
> >> > > >suitable for me is stream-table.
> >> > > >
> >> > > >I see the following things:
> >> > > >
> >> > > >1. Point samza to a table (database)
> >> > > >2. Point Samza to a stream - Alert stream from different sources
> >> > > >3. Join key like a hostname
> >> > > >
> >> > > >I have Hello Samza working. To extend that to do what my needs
> >>are, I
> >> am
> >> > > >not sure where to start (Needs more code change OR configuration
> >> changes
> >> > > >OR
> >> > > >both)?
> >> > > >
> >> > > >I have gone thro
> >> > > >
> >> >
> >>
> >>
> http://samza.incubator.apache.org/learn/documentation/latest/api/overview
> >> > > .
> >> > > >html
> >> > > >
> >> > > >Is my thought process on the right track? Can you please point me
> >>to
> >> the
> >> > > >right direction?
> >> > > >
> >> > > >- Shekar
> >> > > >
> >> > > >
> >> > > >
> >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur <[email protected]>
> >> > wrote:
> >> > > >
> >> > > >> Chris,
> >> > > >>
> >> > > >> This is perfectly good answer. I will start poking more into
> >>option
> >> > #4.
> >> > > >>
> >> > > >> - Shekar
> >> > > >>
> >> > > >>
> >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini <
> >> > > >> [email protected]> wrote:
> >> > > >>
> >> > > >>> Hey Shekar,
> >> > > >>>
> >> > > >>> Your two options are really (3) or (4), then. You can either run
> >> some
> >> > > >>> external DB that holds the data set, and you can query it from a
> >> > > >>> StreamTask, or you can use Samza's state store feature to push
> >>data
> >> > > >>>into a
> >> > > >>> stream that you can then store in a partitioned key-value store
> >> along
> >> > > >>>with
> >> > > >>> your StreamTasks. There is some documentation here about the
> >>state
> >> > > >>>store
> >> > > >>> approach:
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > >
> >>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
> >> > > >>>ate
> >> > > >>> -management.html
> >> > > >>>
> >> > > >>><
> >> > >
> >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/s
> >> > > >>>tate-management.html>
> >> > > >>>
> >> > > >>>
> >> > > >>> (4) is going to require more up front effort from you, since
> >>you'll
> >> > > >>>have
> >> > > >>> to understand how Kafka's partitioning model works, and setup
> >>some
> >> > > >>> pipeline to push the updates for your state. In the long run, I
> >> > believe
> >> > > >>> it's the better approach, though. Local lookups on a key-value
> >> store
> >> > > >>> should be faster than doing remote RPC calls to a DB for every
> >> > message.
> >> > > >>>
> >> > > >>> I'm sorry I can't give you a more definitive answer. It's really
> >> > about
> >> > > >>> trade-offs.
> >> > > >>>
> >> > > >>> Cheers,
> >> > > >>> Chris
> >> > > >>>
> >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote:
> >> > > >>>
> >> > > >>> >Chris,
> >> > > >>> >
> >> > > >>> >A big thanks for a swift response. The data set is huge and the
> >> > > >>>frequency
> >> > > >>> >is in burst.
> >> > > >>> >What do you suggest?
> >> > > >>> >
> >> > > >>> >- Shekar
> >> > > >>> >
> >> > > >>> >
> >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini <
> >> > > >>> >[email protected]> wrote:
> >> > > >>> >
> >> > > >>> >> Hey Shekar,
> >> > > >>> >>
> >> > > >>> >> This is feasible, and you are on the right thought process.
> >> > > >>> >>
> >> > > >>> >> For the sake of discussion, I'm going to pretend that you
> >>have a
> >> > > >>>Kafka
> >> > > >>> >> topic called "PageViewEvent", which has just the IP address
> >>that
> >> > was
> >> > > >>> >>used
> >> > > >>> >> to view a page. These messages will be logged every time a
> >>page
> >> > view
> >> > > >>> >> happens. I'm also going to pretend that you have some state
> >> called
> >> > > >>> >>"IPGeo"
> >> > > >>> >> (e.g. The maxmind data set). In this example, we'll want to
> >>join
> >> > the
> >> > > >>> >> long/lat geo information from IPGeo to the PageViewEvent, and
> >> send
> >> > > >>>it
> >> > > >>> >>to a
> >> > > >>> >> new topic: PageViewEventsWithGeo.
> >> > > >>> >>
> >> > > >>> >> You have several options on how to implement this example.
> >> > > >>> >>
> >> > > >>> >> 1. If your joining data set (IPGeo) is relatively small and
> >> > changes
> >> > > >>> >> infrequently, you can just pack it up in your jar or .tgz
> >>file,
> >> > and
> >> > > >>> open
> >> > > >>> >> it open in every StreamTask.
> >> > > >>> >> 2. If your data set is small, but changes somewhat
> >>frequently,
> >> you
> >> > > >>>can
> >> > > >>> >> throw the data set on some HTTP/HDFS/S3 server somewhere, and
> >> have
> >> > > >>>your
> >> > > >>> >> StreamTask refresh it periodically by re-downloading it.
> >> > > >>> >> 3. You can do remote RPC calls for the IPGeo data on every
> >>page
> >> > view
> >> > > >>> >>event
> >> > > >>> >> by query some remote service or DB (e.g. Cassandra).
> >> > > >>> >> 4. You can use Samza's state feature to set your IPGeo data
> >>as a
> >> > > >>>series
> >> > > >>> >>of
> >> > > >>> >> messages to a log-compacted Kafka topic
> >> > > >>> >> (
> >> https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction
> >> > ),
> >> > > >>> and
> >> > > >>> >> configure your Samza job to read this topic as a bootstrap
> >> stream
> >> > > >>> >> (
> >> > > >>> >>
> >> > > >>> >>
> >> > > >>>
> >> > > >>>
> >> > >
> >>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
> >> > > >>>r
> >> > > >>> >>e
> >> > > >>> >> ams.html).
> >> > > >>> >>
> >> > > >>> >> For (4), you'd have to partition the IPGeo state topic
> >>according
> >> > to
> >> > > >>>the
> >> > > >>> >> same key as PageViewEvent. If PageViewEvent were partitioned
> >>by,
> >> > > >>>say,
> >> > > >>> >> member ID, but you want your IPGeo state topic to be
> >>partitioned
> >> > by
> >> > > >>>IP
> >> > > >>> >> address, then you'd have to have an upstream job that
> >> > re-partitioned
> >> > > >>> >> PageViewEvent into some new topic by IP address. This new
> >>topic
> >> > will
> >> > > >>> >>have
> >> > > >>> >> to have the same number of partitions as the IPGeo state
> >>topic
> >> (if
> >> > > >>> IPGeo
> >> > > >>> >> has 8 partitions, then the new PageViewEventRepartitioned
> >>topic
> >> > > >>>needs 8
> >> > > >>> >>as
> >> > > >>> >> well). This will cause your PageViewEventRepartitioned topic
> >>and
> >> > > >>>your
> >> > > >>> >> IPGeo state topic to be aligned such that the StreamTask that
> >> gets
> >> > > >>>page
> >> > > >>> >> views for IP address X will also have the IPGeo information
> >>for
> >> IP
> >> > > >>> >>address
> >> > > >>> >> X.
> >> > > >>> >>
> >> > > >>> >> Which strategy you pick is really up to you. :) (4) is the
> >>most
> >> > > >>> >> complicated, but also the most flexible, and most
> >>operationally
> >> > > >>>sound.
> >> > > >>> >>(1)
> >> > > >>> >> is the easiest if it fits your needs.
> >> > > >>> >>
> >> > > >>> >> Cheers,
> >> > > >>> >> Chris
> >> > > >>> >>
> >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]>
> >>wrote:
> >> > > >>> >>
> >> > > >>> >> >Hello,
> >> > > >>> >> >
> >> > > >>> >> >I am new to Samza. I have just installed Hello Samza and
> >>got it
> >> > > >>> >>working.
> >> > > >>> >> >
> >> > > >>> >> >Here is the use case for which I am trying to use Samza:
> >> > > >>> >> >
> >> > > >>> >> >
> >> > > >>> >> >1. Cache the contextual information which contains more
> >> > information
> >> > > >>> >>about
> >> > > >>> >> >the hostname or IP address using Samza/Yarn/Kafka
> >> > > >>> >> >2. Collect alert and metric events which contain either
> >> hostname
> >> > > >>>or IP
> >> > > >>> >> >address
> >> > > >>> >> >3. Append contextual information to the alert and metric and
> >> > > >>>insert to
> >> > > >>> >>a
> >> > > >>> >> >Kafka queue from which other subscribers read off of.
> >> > > >>> >> >
> >> > > >>> >> >Can you please shed some light on
> >> > > >>> >> >
> >> > > >>> >> >1. Is this feasible?
> >> > > >>> >> >2. Am I on the right thought process
> >> > > >>> >> >3. How do I start
> >> > > >>> >> >
> >> > > >>> >> >I now have 1 & 2 of them working disparately. I need to
> >> integrate
> >> > > >>> them.
> >> > > >>> >> >
> >> > > >>> >> >Appreciate any input.
> >> > > >>> >> >
> >> > > >>> >> >- Shekar
> >> > > >>> >>
> >> > > >>> >>
> >> > > >>>
> >> > > >>>
> >> > > >>
> >> > >
> >> > >
> >> >
> >>
>
>

Re: Samza as a Caching layer

Reply via email to