Hadoop Aggregate package (o.a.h.mapred.lib.aggregate) is a good fit for your aggregation problem.
Runping > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED] > Sent: Tuesday, August 07, 2007 12:09 PM > To: [email protected] > Subject: Re: Loading data into HDFS > > > > One issue that we see in building log aggregators using hadoop is that we > often want to do several aggregations in a single reduce task. > > For instance, we have viewers who view videos and sometimes watch them to > completion and sometimes scrub to different points in the video and > sometimes close a browser without a session completion event. We want to > run map to extract session id as a reduce key and then have a reduce that > summarizes the basics of the the session (user + video + what happened). > The interesting stuff happens in the second map/reduce where we want to > have > map emit records for user aggregation, user by day aggregation, video > aggregation, and video by day aggregation (this is somewhat simplified, of > course). In the second reduce, we want to compute various aggregation > functions such as total count, total distinct count and a few > distributional > measures such as estimated non-robot distinct views or long-tail > coefficients or day part usage pattern. > > Systems like Pig seem to be built with the assumption that there should be > a > separate reduce task per aggregation. Since we have about a dozen > aggregate > functions that we need to compute per aggregation type, that would entail > about an order of magnitude decrease in performance for us unless the > language is clever enough to put all aggregation functions for the same > aggregation into the same reduce task. > > Also, our current ingestion rates are completely dominated by our > downstream > processing so log file collection isn't a huge deal. For backfilling old > data, we have some need for higher bandwidth, but even a single NFS server > suffices as a source for heavily compressed and consolidated log files. > Our > transaction volumes are pretty modest compared to something like Yahoo, of > course, since we still have less than 20 million monthly uniques, but we > expect this to continue the strong growth that we have been seeing. > > > On 8/7/07 10:45 AM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> wrote: > > > I'll have our operations folks comment on our current techniques. > > > > We use map-reduce jobs to copy from all nodes in the cluster from the > > source. Generally using either HTTP(S) or HDFS protocol. > > > > We've seen write rates as high as 8.3 GBytes/sec on 900 nodes. This > > is network limited. We see roughly 20MBytes/sec/node (double the > > other rate) on one rack clusters, with everything connected with > > gigabit. > > > > We (the yahoo grid team) are planning to put some more energy into > > making the system more useful for real-time log handling in the next > > few releases. For example, I would like to be able to tail -f a file > > as it is written, I would like to have a generic log aggregation > > system and I would like to have the map-reduce framework log directly > > into HDFS using that system. > > > > I'd love to hear thoughts on other achievable improvements that would > > really help in this area. > > > > On Aug 3, 2007, at 1:42 AM, Jeff Hammerbacher wrote: > > > >> We have a service which writes one copy of a logfile directly into > >> HDFS > >> (writes go to namenode). As Dennis mentions, since HDFS does not > >> support > >> atomic appends, if a failure occurs before closing a file, it never > >> appears > >> in the file system. Thus we have to rotate logfiles at a greater > >> frequency > >> that we'd like to "checkpoint" the data into HDFS. The system > >> certainly > >> isn't perfect but bulk-loading the data into HDFS was proving > >> rather slow. > >> I'd be curious to hear actual performance numbers and methodologies > >> for bulk > >> loads. I'll try to dig some up myself on Monday. > >> > >> On 8/2/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: > >>> > >>> You can copy data from any node, so if you can do it from > >> multiple nodes > >>> your performance would be better (although be sure not to overlap > >>> files). The master node is updated once a the block is copied it > >>> replication number of times. So if default replication is 3 then > >> the 3 > >>> replicates must be active before the master is updated and the data > >>> "appears" int the dfs. > >>> > >>> How long the updates take to happen is a function of your server > >> load > >>> and network speed and file size. Generally it is fast. > >>> > >>> So the process is the data is loaded into the dfs, replicates are > >>> created, and the master node is updated. In terms of > >> consistency, if > >>> the data node crashes before the data is loaded then the data won't > >>> appear in the dfs. If the name node crashes before it is updated > >> but > >>> all replicates are active, the data would appear once the name > >> node has > >>> been fixed and updated through block reports. If a single node > >> crashes > >>> that has a replicate once the namenode has been updated then the > >> data > >>> will be replicated from one of the other 2 replicates to another 3 > >>> system if available. > >>> > >>> Dennis Kubes > >>> > >>> Venkates .P.B. wrote: > >>>> Am I missing something very fundamental ? Can someone comment > >> on these > >>>> queries ? > >>>> > >>>> Thanks, > >>>> Venkates P B > >>>> > >>>> On 8/1/07, Venkates .P.B. <[EMAIL PROTECTED]> wrote: > >>>>> > >>>>> Few queries regarding the way data is loaded into HDFS. > >>>>> > >>>>> -Is it a common practice to load the data into HDFS only > >> through the > >>>>> master node ? We are able to copy only around 35 logs (64K > >> each) per > >>> minute > >>>>> in a 2 slave configuration. > >>>>> > >>>>> -We are concerned about time it would take to update filenames > >> and > >>> block > >>>>> maps in the master node when data is loaded from few/all the > >> slave > >>> nodes. > >>>>> Can anyone let me know how long generally it takes for this > >> update to > >>>>> happen. > >>>>> > >>>>> And one more question, what if the node crashes soon after the > >> data is > >>>>> copied into one it. How is data consistency maintained here ? > >>>>> > >>>>> Thanks in advance, > >>>>> Venkates P B > >>>>> > >>>> > >>> > >> > >
