Thanks a lot Min, this is indeed very useful. -- Greg
-----Original Message----- From: Felix GV [mailto:fe...@mate1inc.com] Sent: mercredi 4 juillet 2012 18:19 To: kafka-users@incubator.apache.org Subject: Re: Hadoop Consumer Thanks for the info, that's interesting :) ... And thanks for the link Min :) Having a hadoop consumer that manages the offsets with ZK is cool :) ... -- Felix On Wed, Jul 4, 2012 at 9:04 AM, Sybrandy, Casey < casey.sybra...@six3systems.com> wrote: > We're using CDH3 update 2 or 3. I don't know how much the version > matters, so it may work on plain-old Hadoop. > _____________________ > From: Murtaza Doctor [murt...@richrelevance.com] > Sent: Tuesday, July 03, 2012 1:56 PM > To: kafka-users@incubator.apache.org > Subject: Re: Hadoop Consumer > > +1 This surely sounds interesting. > > On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote: > > >Hmm that's surprising. I didn't know about that...! > > > >I wonder if it's a new feature... Judging from your email, I assume > >you're using CDH? What version? > > > >Interesting :) ... > > > >-- > >Felix > > > > > > > >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < > >casey.sybra...@six3systems.com> wrote: > > > >> >> - Is there a version of consumer which appends to an existing > >> >> file on > >> HDFS > >> >> until it reaches a specific size? > >> >> > >> > > >> >No there isn't, as far as I know. Potential solutions to this would be: > >> > > >> > 1. Leave the data in the broker long enough for it to reach the > >> > size > >> you > >> > want. Running the SimpleKafkaETLJob at those intervals would > >> > give > >>you > >> the > >> > file size you want. This is the simplest thing to do, but the > >>drawback > >> is > >> > that your data in HDFS will be less real-time. > >> > 2. Run the SimpleKafkaETLJob as frequently as you want, and > >> > then > >>roll > >> up > >> > / compact your small files into one bigger file. You would need > >> > to > >> come up > >> > with the hadoop job that does the roll up, or find one somewhere. > >> > 3. Don't use the SimpleKafkaETLJob at all and write a new job > >> > that > >> makes > >> > use of hadoop append instead... > >> > > >> >Also, you may be interested to take a look at these scripts< > >> > >> > http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consum > er/ > >> >I > >> >posted a while ago. If you follow the links in this post, you can > >> >get more details about how the scripts work and why it was > >> >necessary to do > >>the > >> >things it does... or you can just use them without reading. They > >> >should work pretty much out of the box... > >> > >> Where I work, we discovered that you can keep a file in HDFS open > >>and still run MapReduce jobs against the data in that file. What > >>you do is you flush the data periodically (every record for us), > >>but you don't close the file right away. This allows us to have > >>data files that contain 24 hours worth of data, but not have to > >>close the file to run the jobs or to schedule the jobs for after > >>the file is closed. You can also check the file size periodically > >>and rotate the files based on size. We use Avro files, but > >>sequence files should work too according to Cloudera. > >> > >> It's a great compromise for when you want the latest and greatest > >>data, but don't want to have to wait until all of the files are > >>closed to get it. > >> > >> Casey > >