Hey Shekar, Can you attach your log files? I'm wondering if it's a mis-configured log4j.xml (or missing slf4j-log4j jar), which is leading to nearly empty log files. Also, I'm wondering if the job starts fully. Anything you can attach would be helpful.
Cheers, Chris On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> wrote: >I am running in local mode. > >S >On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]> wrote: > >> Hi Shekar. >> >> Are you running job in local mode or yarn mode? If yarn mode, the log >>is in >> the yarn's container log. >> >> Thanks, >> >> Fang, Yan >> [email protected] >> +1 (206) 849-4108 >> >> >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur <[email protected]> >>wrote: >> >> > Chris, >> > >> > Got some time to play around a bit more. >> > I tried to edit >> > >> > >> >>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/WikipediaFeed >>StreamTask.java >> > to add a logger info statement to tap the incoming message. I dont see >> the >> > messages being printed to the log file. >> > >> > Is this the right place to start? >> > >> > public class WikipediaFeedStreamTask implements StreamTask { >> > >> > private static final SystemStream OUTPUT_STREAM = new >> > SystemStream("kafka", >> > "wikipedia-raw"); >> > >> > private static final Logger log = LoggerFactory.getLogger >> > (WikipediaFeedStreamTask.class); >> > >> > @Override >> > >> > public void process(IncomingMessageEnvelope envelope, >>MessageCollector >> > collector, TaskCoordinator coordinator) { >> > >> > Map<String, Object> outgoingMap = >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent) envelope.getMessage()); >> > >> > log.info(envelope.getMessage().toString()); >> > >> > collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, >> > outgoingMap)); >> > >> > } >> > >> > } >> > >> > >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini < >> > [email protected]> wrote: >> > >> > > Hey Shekar, >> > > >> > > Your thought process is on the right track. It's probably best to >>start >> > > with hello-samza, and modify it to get what you want. To start with, >> > > you'll want to: >> > > >> > > 1. Write a simple StreamTask that just does something silly like >>just >> > > print messages that it receives. >> > > 2. Write a configuration for the job that consumes from just the >>stream >> > > (alerts from different sources). >> > > 3. Run this to make sure you've got it working. >> > > 4. Now add your table join. This can be either a change-data capture >> > (CDC) >> > > stream, or via a remote DB call. >> > > >> > > That should get you to a point where you've got your job up and >> running. >> > > From there, you could create your own Maven project, and migrate >>your >> > code >> > > over accordingly. >> > > >> > > Cheers, >> > > Chris >> > > >> > > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]> wrote: >> > > >> > > >Chris, >> > > > >> > > >I have gone thro the documentation and decided that the option >>that is >> > > >most >> > > >suitable for me is stream-table. >> > > > >> > > >I see the following things: >> > > > >> > > >1. Point samza to a table (database) >> > > >2. Point Samza to a stream - Alert stream from different sources >> > > >3. Join key like a hostname >> > > > >> > > >I have Hello Samza working. To extend that to do what my needs >>are, I >> am >> > > >not sure where to start (Needs more code change OR configuration >> changes >> > > >OR >> > > >both)? >> > > > >> > > >I have gone thro >> > > > >> > >> >>http://samza.incubator.apache.org/learn/documentation/latest/api/overview >> > > . >> > > >html >> > > > >> > > >Is my thought process on the right track? Can you please point me >>to >> the >> > > >right direction? >> > > > >> > > >- Shekar >> > > > >> > > > >> > > > >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur <[email protected]> >> > wrote: >> > > > >> > > >> Chris, >> > > >> >> > > >> This is perfectly good answer. I will start poking more into >>option >> > #4. >> > > >> >> > > >> - Shekar >> > > >> >> > > >> >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini < >> > > >> [email protected]> wrote: >> > > >> >> > > >>> Hey Shekar, >> > > >>> >> > > >>> Your two options are really (3) or (4), then. You can either run >> some >> > > >>> external DB that holds the data set, and you can query it from a >> > > >>> StreamTask, or you can use Samza's state store feature to push >>data >> > > >>>into a >> > > >>> stream that you can then store in a partitioned key-value store >> along >> > > >>>with >> > > >>> your StreamTasks. There is some documentation here about the >>state >> > > >>>store >> > > >>> approach: >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st >> > > >>>ate >> > > >>> -management.html >> > > >>> >> > > >>>< >> > > >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/s >> > > >>>tate-management.html> >> > > >>> >> > > >>> >> > > >>> (4) is going to require more up front effort from you, since >>you'll >> > > >>>have >> > > >>> to understand how Kafka's partitioning model works, and setup >>some >> > > >>> pipeline to push the updates for your state. In the long run, I >> > believe >> > > >>> it's the better approach, though. Local lookups on a key-value >> store >> > > >>> should be faster than doing remote RPC calls to a DB for every >> > message. >> > > >>> >> > > >>> I'm sorry I can't give you a more definitive answer. It's really >> > about >> > > >>> trade-offs. >> > > >>> >> > > >>> Cheers, >> > > >>> Chris >> > > >>> >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote: >> > > >>> >> > > >>> >Chris, >> > > >>> > >> > > >>> >A big thanks for a swift response. The data set is huge and the >> > > >>>frequency >> > > >>> >is in burst. >> > > >>> >What do you suggest? >> > > >>> > >> > > >>> >- Shekar >> > > >>> > >> > > >>> > >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini < >> > > >>> >[email protected]> wrote: >> > > >>> > >> > > >>> >> Hey Shekar, >> > > >>> >> >> > > >>> >> This is feasible, and you are on the right thought process. >> > > >>> >> >> > > >>> >> For the sake of discussion, I'm going to pretend that you >>have a >> > > >>>Kafka >> > > >>> >> topic called "PageViewEvent", which has just the IP address >>that >> > was >> > > >>> >>used >> > > >>> >> to view a page. These messages will be logged every time a >>page >> > view >> > > >>> >> happens. I'm also going to pretend that you have some state >> called >> > > >>> >>"IPGeo" >> > > >>> >> (e.g. The maxmind data set). In this example, we'll want to >>join >> > the >> > > >>> >> long/lat geo information from IPGeo to the PageViewEvent, and >> send >> > > >>>it >> > > >>> >>to a >> > > >>> >> new topic: PageViewEventsWithGeo. >> > > >>> >> >> > > >>> >> You have several options on how to implement this example. >> > > >>> >> >> > > >>> >> 1. If your joining data set (IPGeo) is relatively small and >> > changes >> > > >>> >> infrequently, you can just pack it up in your jar or .tgz >>file, >> > and >> > > >>> open >> > > >>> >> it open in every StreamTask. >> > > >>> >> 2. If your data set is small, but changes somewhat >>frequently, >> you >> > > >>>can >> > > >>> >> throw the data set on some HTTP/HDFS/S3 server somewhere, and >> have >> > > >>>your >> > > >>> >> StreamTask refresh it periodically by re-downloading it. >> > > >>> >> 3. You can do remote RPC calls for the IPGeo data on every >>page >> > view >> > > >>> >>event >> > > >>> >> by query some remote service or DB (e.g. Cassandra). >> > > >>> >> 4. You can use Samza's state feature to set your IPGeo data >>as a >> > > >>>series >> > > >>> >>of >> > > >>> >> messages to a log-compacted Kafka topic >> > > >>> >> ( >> https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction >> > ), >> > > >>> and >> > > >>> >> configure your Samza job to read this topic as a bootstrap >> stream >> > > >>> >> ( >> > > >>> >> >> > > >>> >> >> > > >>> >> > > >>> >> > > >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st >> > > >>>r >> > > >>> >>e >> > > >>> >> ams.html). >> > > >>> >> >> > > >>> >> For (4), you'd have to partition the IPGeo state topic >>according >> > to >> > > >>>the >> > > >>> >> same key as PageViewEvent. If PageViewEvent were partitioned >>by, >> > > >>>say, >> > > >>> >> member ID, but you want your IPGeo state topic to be >>partitioned >> > by >> > > >>>IP >> > > >>> >> address, then you'd have to have an upstream job that >> > re-partitioned >> > > >>> >> PageViewEvent into some new topic by IP address. This new >>topic >> > will >> > > >>> >>have >> > > >>> >> to have the same number of partitions as the IPGeo state >>topic >> (if >> > > >>> IPGeo >> > > >>> >> has 8 partitions, then the new PageViewEventRepartitioned >>topic >> > > >>>needs 8 >> > > >>> >>as >> > > >>> >> well). This will cause your PageViewEventRepartitioned topic >>and >> > > >>>your >> > > >>> >> IPGeo state topic to be aligned such that the StreamTask that >> gets >> > > >>>page >> > > >>> >> views for IP address X will also have the IPGeo information >>for >> IP >> > > >>> >>address >> > > >>> >> X. >> > > >>> >> >> > > >>> >> Which strategy you pick is really up to you. :) (4) is the >>most >> > > >>> >> complicated, but also the most flexible, and most >>operationally >> > > >>>sound. >> > > >>> >>(1) >> > > >>> >> is the easiest if it fits your needs. >> > > >>> >> >> > > >>> >> Cheers, >> > > >>> >> Chris >> > > >>> >> >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]> >>wrote: >> > > >>> >> >> > > >>> >> >Hello, >> > > >>> >> > >> > > >>> >> >I am new to Samza. I have just installed Hello Samza and >>got it >> > > >>> >>working. >> > > >>> >> > >> > > >>> >> >Here is the use case for which I am trying to use Samza: >> > > >>> >> > >> > > >>> >> > >> > > >>> >> >1. Cache the contextual information which contains more >> > information >> > > >>> >>about >> > > >>> >> >the hostname or IP address using Samza/Yarn/Kafka >> > > >>> >> >2. Collect alert and metric events which contain either >> hostname >> > > >>>or IP >> > > >>> >> >address >> > > >>> >> >3. Append contextual information to the alert and metric and >> > > >>>insert to >> > > >>> >>a >> > > >>> >> >Kafka queue from which other subscribers read off of. >> > > >>> >> > >> > > >>> >> >Can you please shed some light on >> > > >>> >> > >> > > >>> >> >1. Is this feasible? >> > > >>> >> >2. Am I on the right thought process >> > > >>> >> >3. How do I start >> > > >>> >> > >> > > >>> >> >I now have 1 & 2 of them working disparately. I need to >> integrate >> > > >>> them. >> > > >>> >> > >> > > >>> >> >Appreciate any input. >> > > >>> >> > >> > > >>> >> >- Shekar >> > > >>> >> >> > > >>> >> >> > > >>> >> > > >>> >> > > >> >> > > >> > > >> > >>
