Chris .. $ cat ./deploy/samza/undefined-samza-container-name.log
2014-09-02 11:17:58 JobRunner [INFO] job factory: org.apache.samza.job.yarn.YarnJobFactory 2014-09-02 11:17:59 ClientHelper [INFO] trying to connect to RM 127.0.0.1:8032 2014-09-02 11:17:59 NativeCodeLoader [WARN] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-09-02 11:17:59 RMProxy [INFO] Connecting to ResourceManager at / 127.0.0.1:8032 and Log4j .. <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> <log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/"> <appender name="RollingAppender" class="org.apache.log4j.DailyRollingFileAppender"> <param name="File" value="${samza.log.dir}/${samza.container.name}.log" /> <param name="DatePattern" value="'.'yyyy-MM-dd" /> <layout class="org.apache.log4j.PatternLayout"> <param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} %c{1} [%p] %m%n" /> </layout> </appender> <root> <priority value="info" /> <appender-ref ref="RollingAppender"/> </root> </log4j:configuration> On Tue, Sep 2, 2014 at 12:02 PM, Chris Riccomini < [email protected]> wrote: > Hey Shekar, > > Can you attach your log files? I'm wondering if it's a mis-configured > log4j.xml (or missing slf4j-log4j jar), which is leading to nearly empty > log files. Also, I'm wondering if the job starts fully. Anything you can > attach would be helpful. > > Cheers, > Chris > > On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> wrote: > > >I am running in local mode. > > > >S > >On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]> wrote: > > > >> Hi Shekar. > >> > >> Are you running job in local mode or yarn mode? If yarn mode, the log > >>is in > >> the yarn's container log. > >> > >> Thanks, > >> > >> Fang, Yan > >> [email protected] > >> +1 (206) 849-4108 > >> > >> > >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur <[email protected]> > >>wrote: > >> > >> > Chris, > >> > > >> > Got some time to play around a bit more. > >> > I tried to edit > >> > > >> > > >> > >>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/WikipediaFeed > >>StreamTask.java > >> > to add a logger info statement to tap the incoming message. I dont see > >> the > >> > messages being printed to the log file. > >> > > >> > Is this the right place to start? > >> > > >> > public class WikipediaFeedStreamTask implements StreamTask { > >> > > >> > private static final SystemStream OUTPUT_STREAM = new > >> > SystemStream("kafka", > >> > "wikipedia-raw"); > >> > > >> > private static final Logger log = LoggerFactory.getLogger > >> > (WikipediaFeedStreamTask.class); > >> > > >> > @Override > >> > > >> > public void process(IncomingMessageEnvelope envelope, > >>MessageCollector > >> > collector, TaskCoordinator coordinator) { > >> > > >> > Map<String, Object> outgoingMap = > >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent) envelope.getMessage()); > >> > > >> > log.info(envelope.getMessage().toString()); > >> > > >> > collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, > >> > outgoingMap)); > >> > > >> > } > >> > > >> > } > >> > > >> > > >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini < > >> > [email protected]> wrote: > >> > > >> > > Hey Shekar, > >> > > > >> > > Your thought process is on the right track. It's probably best to > >>start > >> > > with hello-samza, and modify it to get what you want. To start with, > >> > > you'll want to: > >> > > > >> > > 1. Write a simple StreamTask that just does something silly like > >>just > >> > > print messages that it receives. > >> > > 2. Write a configuration for the job that consumes from just the > >>stream > >> > > (alerts from different sources). > >> > > 3. Run this to make sure you've got it working. > >> > > 4. Now add your table join. This can be either a change-data capture > >> > (CDC) > >> > > stream, or via a remote DB call. > >> > > > >> > > That should get you to a point where you've got your job up and > >> running. > >> > > From there, you could create your own Maven project, and migrate > >>your > >> > code > >> > > over accordingly. > >> > > > >> > > Cheers, > >> > > Chris > >> > > > >> > > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]> wrote: > >> > > > >> > > >Chris, > >> > > > > >> > > >I have gone thro the documentation and decided that the option > >>that is > >> > > >most > >> > > >suitable for me is stream-table. > >> > > > > >> > > >I see the following things: > >> > > > > >> > > >1. Point samza to a table (database) > >> > > >2. Point Samza to a stream - Alert stream from different sources > >> > > >3. Join key like a hostname > >> > > > > >> > > >I have Hello Samza working. To extend that to do what my needs > >>are, I > >> am > >> > > >not sure where to start (Needs more code change OR configuration > >> changes > >> > > >OR > >> > > >both)? > >> > > > > >> > > >I have gone thro > >> > > > > >> > > >> > >> > http://samza.incubator.apache.org/learn/documentation/latest/api/overview > >> > > . > >> > > >html > >> > > > > >> > > >Is my thought process on the right track? Can you please point me > >>to > >> the > >> > > >right direction? > >> > > > > >> > > >- Shekar > >> > > > > >> > > > > >> > > > > >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur <[email protected]> > >> > wrote: > >> > > > > >> > > >> Chris, > >> > > >> > >> > > >> This is perfectly good answer. I will start poking more into > >>option > >> > #4. > >> > > >> > >> > > >> - Shekar > >> > > >> > >> > > >> > >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini < > >> > > >> [email protected]> wrote: > >> > > >> > >> > > >>> Hey Shekar, > >> > > >>> > >> > > >>> Your two options are really (3) or (4), then. You can either run > >> some > >> > > >>> external DB that holds the data set, and you can query it from a > >> > > >>> StreamTask, or you can use Samza's state store feature to push > >>data > >> > > >>>into a > >> > > >>> stream that you can then store in a partitioned key-value store > >> along > >> > > >>>with > >> > > >>> your StreamTasks. There is some documentation here about the > >>state > >> > > >>>store > >> > > >>> approach: > >> > > >>> > >> > > >>> > >> > > >>> > >> > > >>> > >> > > > >> > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st > >> > > >>>ate > >> > > >>> -management.html > >> > > >>> > >> > > >>>< > >> > > > >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/s > >> > > >>>tate-management.html> > >> > > >>> > >> > > >>> > >> > > >>> (4) is going to require more up front effort from you, since > >>you'll > >> > > >>>have > >> > > >>> to understand how Kafka's partitioning model works, and setup > >>some > >> > > >>> pipeline to push the updates for your state. In the long run, I > >> > believe > >> > > >>> it's the better approach, though. Local lookups on a key-value > >> store > >> > > >>> should be faster than doing remote RPC calls to a DB for every > >> > message. > >> > > >>> > >> > > >>> I'm sorry I can't give you a more definitive answer. It's really > >> > about > >> > > >>> trade-offs. > >> > > >>> > >> > > >>> Cheers, > >> > > >>> Chris > >> > > >>> > >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote: > >> > > >>> > >> > > >>> >Chris, > >> > > >>> > > >> > > >>> >A big thanks for a swift response. The data set is huge and the > >> > > >>>frequency > >> > > >>> >is in burst. > >> > > >>> >What do you suggest? > >> > > >>> > > >> > > >>> >- Shekar > >> > > >>> > > >> > > >>> > > >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini < > >> > > >>> >[email protected]> wrote: > >> > > >>> > > >> > > >>> >> Hey Shekar, > >> > > >>> >> > >> > > >>> >> This is feasible, and you are on the right thought process. > >> > > >>> >> > >> > > >>> >> For the sake of discussion, I'm going to pretend that you > >>have a > >> > > >>>Kafka > >> > > >>> >> topic called "PageViewEvent", which has just the IP address > >>that > >> > was > >> > > >>> >>used > >> > > >>> >> to view a page. These messages will be logged every time a > >>page > >> > view > >> > > >>> >> happens. I'm also going to pretend that you have some state > >> called > >> > > >>> >>"IPGeo" > >> > > >>> >> (e.g. The maxmind data set). In this example, we'll want to > >>join > >> > the > >> > > >>> >> long/lat geo information from IPGeo to the PageViewEvent, and > >> send > >> > > >>>it > >> > > >>> >>to a > >> > > >>> >> new topic: PageViewEventsWithGeo. > >> > > >>> >> > >> > > >>> >> You have several options on how to implement this example. > >> > > >>> >> > >> > > >>> >> 1. If your joining data set (IPGeo) is relatively small and > >> > changes > >> > > >>> >> infrequently, you can just pack it up in your jar or .tgz > >>file, > >> > and > >> > > >>> open > >> > > >>> >> it open in every StreamTask. > >> > > >>> >> 2. If your data set is small, but changes somewhat > >>frequently, > >> you > >> > > >>>can > >> > > >>> >> throw the data set on some HTTP/HDFS/S3 server somewhere, and > >> have > >> > > >>>your > >> > > >>> >> StreamTask refresh it periodically by re-downloading it. > >> > > >>> >> 3. You can do remote RPC calls for the IPGeo data on every > >>page > >> > view > >> > > >>> >>event > >> > > >>> >> by query some remote service or DB (e.g. Cassandra). > >> > > >>> >> 4. You can use Samza's state feature to set your IPGeo data > >>as a > >> > > >>>series > >> > > >>> >>of > >> > > >>> >> messages to a log-compacted Kafka topic > >> > > >>> >> ( > >> https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction > >> > ), > >> > > >>> and > >> > > >>> >> configure your Samza job to read this topic as a bootstrap > >> stream > >> > > >>> >> ( > >> > > >>> >> > >> > > >>> >> > >> > > >>> > >> > > >>> > >> > > > >> > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st > >> > > >>>r > >> > > >>> >>e > >> > > >>> >> ams.html). > >> > > >>> >> > >> > > >>> >> For (4), you'd have to partition the IPGeo state topic > >>according > >> > to > >> > > >>>the > >> > > >>> >> same key as PageViewEvent. If PageViewEvent were partitioned > >>by, > >> > > >>>say, > >> > > >>> >> member ID, but you want your IPGeo state topic to be > >>partitioned > >> > by > >> > > >>>IP > >> > > >>> >> address, then you'd have to have an upstream job that > >> > re-partitioned > >> > > >>> >> PageViewEvent into some new topic by IP address. This new > >>topic > >> > will > >> > > >>> >>have > >> > > >>> >> to have the same number of partitions as the IPGeo state > >>topic > >> (if > >> > > >>> IPGeo > >> > > >>> >> has 8 partitions, then the new PageViewEventRepartitioned > >>topic > >> > > >>>needs 8 > >> > > >>> >>as > >> > > >>> >> well). This will cause your PageViewEventRepartitioned topic > >>and > >> > > >>>your > >> > > >>> >> IPGeo state topic to be aligned such that the StreamTask that > >> gets > >> > > >>>page > >> > > >>> >> views for IP address X will also have the IPGeo information > >>for > >> IP > >> > > >>> >>address > >> > > >>> >> X. > >> > > >>> >> > >> > > >>> >> Which strategy you pick is really up to you. :) (4) is the > >>most > >> > > >>> >> complicated, but also the most flexible, and most > >>operationally > >> > > >>>sound. > >> > > >>> >>(1) > >> > > >>> >> is the easiest if it fits your needs. > >> > > >>> >> > >> > > >>> >> Cheers, > >> > > >>> >> Chris > >> > > >>> >> > >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]> > >>wrote: > >> > > >>> >> > >> > > >>> >> >Hello, > >> > > >>> >> > > >> > > >>> >> >I am new to Samza. I have just installed Hello Samza and > >>got it > >> > > >>> >>working. > >> > > >>> >> > > >> > > >>> >> >Here is the use case for which I am trying to use Samza: > >> > > >>> >> > > >> > > >>> >> > > >> > > >>> >> >1. Cache the contextual information which contains more > >> > information > >> > > >>> >>about > >> > > >>> >> >the hostname or IP address using Samza/Yarn/Kafka > >> > > >>> >> >2. Collect alert and metric events which contain either > >> hostname > >> > > >>>or IP > >> > > >>> >> >address > >> > > >>> >> >3. Append contextual information to the alert and metric and > >> > > >>>insert to > >> > > >>> >>a > >> > > >>> >> >Kafka queue from which other subscribers read off of. > >> > > >>> >> > > >> > > >>> >> >Can you please shed some light on > >> > > >>> >> > > >> > > >>> >> >1. Is this feasible? > >> > > >>> >> >2. Am I on the right thought process > >> > > >>> >> >3. How do I start > >> > > >>> >> > > >> > > >>> >> >I now have 1 & 2 of them working disparately. I need to > >> integrate > >> > > >>> them. > >> > > >>> >> > > >> > > >>> >> >Appreciate any input. > >> > > >>> >> > > >> > > >>> >> >- Shekar > >> > > >>> >> > >> > > >>> >> > >> > > >>> > >> > > >>> > >> > > >> > >> > > > >> > > > >> > > >> > >
