+1 This surely sounds interesting. On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:
>Hmm that's surprising. I didn't know about that...! > >I wonder if it's a new feature... Judging from your email, I assume you're >using CDH? What version? > >Interesting :) ... > >-- >Felix > > > >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < >casey.sybra...@six3systems.com> wrote: > >> >> - Is there a version of consumer which appends to an existing file on >> HDFS >> >> until it reaches a specific size? >> >> >> > >> >No there isn't, as far as I know. Potential solutions to this would be: >> > >> > 1. Leave the data in the broker long enough for it to reach the size >> you >> > want. Running the SimpleKafkaETLJob at those intervals would give >>you >> the >> > file size you want. This is the simplest thing to do, but the >>drawback >> is >> > that your data in HDFS will be less real-time. >> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then >>roll >> up >> > / compact your small files into one bigger file. You would need to >> come up >> > with the hadoop job that does the roll up, or find one somewhere. >> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >> makes >> > use of hadoop append instead... >> > >> >Also, you may be interested to take a look at these >> >scripts< >> >>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/ >> >I >> >posted a while ago. If you follow the links in this post, you can get >> >more details about how the scripts work and why it was necessary to do >>the >> >things it does... or you can just use them without reading. They should >> >work pretty much out of the box... >> >> Where I work, we discovered that you can keep a file in HDFS open and >> still run MapReduce jobs against the data in that file. What you do is >>you >> flush the data periodically (every record for us), but you don't close >>the >> file right away. This allows us to have data files that contain 24 >>hours >> worth of data, but not have to close the file to run the jobs or to >> schedule the jobs for after the file is closed. You can also check the >> file size periodically and rotate the files based on size. We use Avro >> files, but sequence files should work too according to Cloudera. >> >> It's a great compromise for when you want the latest and greatest data, >> but don't want to have to wait until all of the files are closed to get >>it. >> >> Casey