Chris, I have gone thro the documentation and decided that the option that is most suitable for me is stream-table.
I see the following things: 1. Point samza to a table (database) 2. Point Samza to a stream - Alert stream from different sources 3. Join key like a hostname I have Hello Samza working. To extend that to do what my needs are, I am not sure where to start (Needs more code change OR configuration changes OR both)? I have gone thro http://samza.incubator.apache.org/learn/documentation/latest/api/overview.html Is my thought process on the right track? Can you please point me to the right direction? - Shekar On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur <[email protected]> wrote: > Chris, > > This is perfectly good answer. I will start poking more into option #4. > > - Shekar > > > On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini < > [email protected]> wrote: > >> Hey Shekar, >> >> Your two options are really (3) or (4), then. You can either run some >> external DB that holds the data set, and you can query it from a >> StreamTask, or you can use Samza's state store feature to push data into a >> stream that you can then store in a partitioned key-value store along with >> your StreamTasks. There is some documentation here about the state store >> approach: >> >> >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state >> -management.html >> <http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html> >> >> >> (4) is going to require more up front effort from you, since you'll have >> to understand how Kafka's partitioning model works, and setup some >> pipeline to push the updates for your state. In the long run, I believe >> it's the better approach, though. Local lookups on a key-value store >> should be faster than doing remote RPC calls to a DB for every message. >> >> I'm sorry I can't give you a more definitive answer. It's really about >> trade-offs. >> >> Cheers, >> Chris >> >> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote: >> >> >Chris, >> > >> >A big thanks for a swift response. The data set is huge and the frequency >> >is in burst. >> >What do you suggest? >> > >> >- Shekar >> > >> > >> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini < >> >[email protected]> wrote: >> > >> >> Hey Shekar, >> >> >> >> This is feasible, and you are on the right thought process. >> >> >> >> For the sake of discussion, I'm going to pretend that you have a Kafka >> >> topic called "PageViewEvent", which has just the IP address that was >> >>used >> >> to view a page. These messages will be logged every time a page view >> >> happens. I'm also going to pretend that you have some state called >> >>"IPGeo" >> >> (e.g. The maxmind data set). In this example, we'll want to join the >> >> long/lat geo information from IPGeo to the PageViewEvent, and send it >> >>to a >> >> new topic: PageViewEventsWithGeo. >> >> >> >> You have several options on how to implement this example. >> >> >> >> 1. If your joining data set (IPGeo) is relatively small and changes >> >> infrequently, you can just pack it up in your jar or .tgz file, and >> open >> >> it open in every StreamTask. >> >> 2. If your data set is small, but changes somewhat frequently, you can >> >> throw the data set on some HTTP/HDFS/S3 server somewhere, and have your >> >> StreamTask refresh it periodically by re-downloading it. >> >> 3. You can do remote RPC calls for the IPGeo data on every page view >> >>event >> >> by query some remote service or DB (e.g. Cassandra). >> >> 4. You can use Samza's state feature to set your IPGeo data as a series >> >>of >> >> messages to a log-compacted Kafka topic >> >> (https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction), >> and >> >> configure your Samza job to read this topic as a bootstrap stream >> >> ( >> >> >> >> >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/str >> >>e >> >> ams.html). >> >> >> >> For (4), you'd have to partition the IPGeo state topic according to the >> >> same key as PageViewEvent. If PageViewEvent were partitioned by, say, >> >> member ID, but you want your IPGeo state topic to be partitioned by IP >> >> address, then you'd have to have an upstream job that re-partitioned >> >> PageViewEvent into some new topic by IP address. This new topic will >> >>have >> >> to have the same number of partitions as the IPGeo state topic (if >> IPGeo >> >> has 8 partitions, then the new PageViewEventRepartitioned topic needs 8 >> >>as >> >> well). This will cause your PageViewEventRepartitioned topic and your >> >> IPGeo state topic to be aligned such that the StreamTask that gets page >> >> views for IP address X will also have the IPGeo information for IP >> >>address >> >> X. >> >> >> >> Which strategy you pick is really up to you. :) (4) is the most >> >> complicated, but also the most flexible, and most operationally sound. >> >>(1) >> >> is the easiest if it fits your needs. >> >> >> >> Cheers, >> >> Chris >> >> >> >> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]> wrote: >> >> >> >> >Hello, >> >> > >> >> >I am new to Samza. I have just installed Hello Samza and got it >> >>working. >> >> > >> >> >Here is the use case for which I am trying to use Samza: >> >> > >> >> > >> >> >1. Cache the contextual information which contains more information >> >>about >> >> >the hostname or IP address using Samza/Yarn/Kafka >> >> >2. Collect alert and metric events which contain either hostname or IP >> >> >address >> >> >3. Append contextual information to the alert and metric and insert to >> >>a >> >> >Kafka queue from which other subscribers read off of. >> >> > >> >> >Can you please shed some light on >> >> > >> >> >1. Is this feasible? >> >> >2. Am I on the right thought process >> >> >3. How do I start >> >> > >> >> >I now have 1 & 2 of them working disparately. I need to integrate >> them. >> >> > >> >> >Appreciate any input. >> >> > >> >> >- Shekar >> >> >> >> >> >> >
