Not quite sure what you want to know. We've been using it successfully. Total data rates aren't enormous; a few MB/sec/collector I think, but it's been benchmarked well past that. The SocketTee was particularly designed for the case where some data loss is OK. It won't buffer for later delivery; you have to wait until the HDFS copy is available.
--Ari On Tue, May 11, 2010 at 11:37 AM, Jerome Boulon <jbou...@netflix.com> wrote: > Hey Corbin, > > What kind of partitioner do you need? > I'm using one based on a hashing function of the key. > Let me know if that would work for you? > > Regarding the TeeWriter, I would like to also get feedback on it, Ari? > > /Jerome. > > On 5/11/10 11:24 AM, "Corbin Hoenes" <cor...@tynt.com> wrote: > >> Eric, >> >> Thanks you guys are spot on with your analysis of our demux issue--right now >> we have a single data type. We can probably split that into two different >> types later but still won't help much until the partitioning is either >> pluggable or somewhat configurable as CHUKWA-481 states. >> >> My questions about the Tee are more related to low latency requirements of >> creating more realtime like feeds of our data. My initial thought is that if >> I could get data out of hadoop in 10 or 5 minute intervals that it might be >> "good enough" for this so I was interested in speeding up demux a bit. But >> now I think the right thing will be using the Tee and getting the data into a >> different system to create these feeds and let hadoop handle the large scale >> analysis only. >> >> The Tee seems perfect...will have to try it out hoping to get feedback from >> people that are using it like this. Sounds like Ari does. >> >> On May 11, 2010, at 12:03 PM, Eric Yang wrote: >> >>> Corbin, >>> >>> Multiple collectors will improve the mapper processing speed, but the >>> reducer is still the long tail of the demux processing. It sounds like you >>> have large amount of same type of data. It will definitely speed up your >>> processing once CHUKWA-481 is addressed. >>> >>> Regards, >>> Eric >>> >>> >>> On 5/10/10 7:34 PM, "Corbin Hoenes" <cor...@tynt.com> wrote: >>> >>>> We are processing apache log files. The current scale is 70-80GB per >>>> day...but we'd like it to have a story for scaling up to move. Just >>>> checking >>>> my collector logs it appears the data rate is still ranges from 600KB-1.2 >>>> MB. >>>> This is all from one collector. Does your setup use multiple collectors? >>>> My >>>> thought is that multiple collectors could be used to scale out once we >>>> reach >>>> a >>>> data rate that caused issues for a single collector. >>>> >>>> Any chance you know where that data rate is? >>>> >>>> On May 10, 2010, at 5:37 PM, Ariel Rabkin wrote: >>>> >>>>> That's how we use it at Berkeley, to process metrics from hundreds of >>>>> machines; total data rate less than a megabyte per second, though. >>>>> What scale of data are you looking at? >>>>> >>>>> The intent of SocketTee was if you need some subset of the data now, >>>>> while write-to-HDFS-and-process-with-Hadoop is still the default path. >>>>> What sort of low-latency processing do you need? >>>>> >>>>> --Ari >>>>> >>>>> On Mon, May 10, 2010 at 4:28 PM, Corbin Hoenes <cor...@tynt.com> wrote: >>>>>> Has anyone used the "Tee" in a larger scale deployment to try to get >>>>>> real-time/low latency data? Interested in how feasible it would be to >>>>>> use >>>>>> it to pipe data into another system to handle these low latency requests >>>>>> and >>>>>> leave the long term analysis to hadoop. >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ari Rabkin asrab...@gmail.com >>>>> UC Berkeley Computer Science Department >>>> >>> >> >> > > -- Ari Rabkin asrab...@gmail.com UC Berkeley Computer Science Department