Hi Bobby, Thank you. Great help.
Zhiwei On 22 May 2012 14:52, Robert Evans <ev...@yahoo-inc.com> wrote: > If you want the results to come out instantly Map/Reduce is not the proper > choice. Map/Reduce is designed for batch processing. It can do small > batches, but the overhead of launching the map/redcue jobs can be very high > compared to the amount of processing you are doing. I personally would > look into using either Storm, S4, or some other realtime stream processing > framework. From what you have said it sounds like you probably want to use > Storm, as it can be used to guarantee that each event is processed once and > only once. You can also store your results into HDFS if you want, perhaps > through HBASE, if you need to do further processing on the data. > > --Bobby Evans > > On 5/22/12 5:02 AM, "Zhiwei Lin" <zhiwei...@gmail.com> wrote: > > Hi Robert, > Thank you. > How quickly do you have to get the result out once the new data is added? > If possible, I hope to get the result instantly. > > How far back in time do you have to look for BBBB from the occurrence of > bbbb? > The time slot is not constant. It depends on the "last" occurrence of BBBB > in front of bbbb. So, I need to look up the history to get the last BBBB > in this case. > > Do you have to do this for all combinations of values or is it just a small > subset of values? > I think this depends on the time of last occurrence of BBBB in the history. > If BBBB rarely occurred, then the early stage data has to be taken into > account. > > Definitely, I think HDFS is a good place to store the data I have (the size > of daily log is above 1GB). But I am not sure if Map/Reduce can help to > handle the stated problem. > > Zhiwei > > > On 21 May 2012 22:07, Robert Evans <ev...@yahoo-inc.com> wrote: > > > Zhiwei, > > > > How quickly do you have to get the result out once the new data is added? > > How far back in time do you have to look for BBBB from the occurrence of > > bbbb? Do you have to do this for all combinations of values or is it > just > > a small subset of values? > > > > --Bobby Evans > > > > On 5/21/12 3:01 PM, "Zhiwei Lin" <zhiwei...@gmail.com> wrote: > > > > I have large volume of stream log data. Each data record contains a time > > stamp, which is very important to the analysis. > > For example, I have data format like this: > > (1) 20:30:21 01/April/2012 AAAAA............. > > (2) 20:30:51 01/April/2012 BBBB............. > > (3) 21:30:21 01/April/2012 bbbb............. > > > > Moreover, new data comes every few minutes. > > I have to calculate the probability of the occurrence "bbbb" given the > > occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is > > really time-dependant. > > > > I wonder if Hadoop is the right platform for this job? Is there any > > package available for this kind of work? > > > > Thank you. > > > > Zhiwei > > > > > > > -- > > Best wishes. > > Zhiwei > > -- Best wishes. Zhiwei