Thanks for the info, that's interesting :) ... And thanks for the link Min :) Having a hadoop consumer that manages the offsets with ZK is cool :) ...
-- Felix On Wed, Jul 4, 2012 at 9:04 AM, Sybrandy, Casey < casey.sybra...@six3systems.com> wrote: > We're using CDH3 update 2 or 3. I don't know how much the version > matters, so it may work on plain-old Hadoop. > _____________________ > From: Murtaza Doctor [murt...@richrelevance.com] > Sent: Tuesday, July 03, 2012 1:56 PM > To: kafka-users@incubator.apache.org > Subject: Re: Hadoop Consumer > > +1 This surely sounds interesting. > > On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote: > > >Hmm that's surprising. I didn't know about that...! > > > >I wonder if it's a new feature... Judging from your email, I assume you're > >using CDH? What version? > > > >Interesting :) ... > > > >-- > >Felix > > > > > > > >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < > >casey.sybra...@six3systems.com> wrote: > > > >> >> - Is there a version of consumer which appends to an existing file on > >> HDFS > >> >> until it reaches a specific size? > >> >> > >> > > >> >No there isn't, as far as I know. Potential solutions to this would be: > >> > > >> > 1. Leave the data in the broker long enough for it to reach the size > >> you > >> > want. Running the SimpleKafkaETLJob at those intervals would give > >>you > >> the > >> > file size you want. This is the simplest thing to do, but the > >>drawback > >> is > >> > that your data in HDFS will be less real-time. > >> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then > >>roll > >> up > >> > / compact your small files into one bigger file. You would need to > >> come up > >> > with the hadoop job that does the roll up, or find one somewhere. > >> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that > >> makes > >> > use of hadoop append instead... > >> > > >> >Also, you may be interested to take a look at these > >> >scripts< > >> > >> > http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/ > >> >I > >> >posted a while ago. If you follow the links in this post, you can get > >> >more details about how the scripts work and why it was necessary to do > >>the > >> >things it does... or you can just use them without reading. They should > >> >work pretty much out of the box... > >> > >> Where I work, we discovered that you can keep a file in HDFS open and > >> still run MapReduce jobs against the data in that file. What you do is > >>you > >> flush the data periodically (every record for us), but you don't close > >>the > >> file right away. This allows us to have data files that contain 24 > >>hours > >> worth of data, but not have to close the file to run the jobs or to > >> schedule the jobs for after the file is closed. You can also check the > >> file size periodically and rotate the files based on size. We use Avro > >> files, but sequence files should work too according to Cloudera. > >> > >> It's a great compromise for when you want the latest and greatest data, > >> but don't want to have to wait until all of the files are closed to get > >>it. > >> > >> Casey > >