Hmm that's surprising. I didn't know about that...! I wonder if it's a new feature... Judging from your email, I assume you're using CDH? What version?
Interesting :) ... -- Felix On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < casey.sybra...@six3systems.com> wrote: > >> - Is there a version of consumer which appends to an existing file on > HDFS > >> until it reaches a specific size? > >> > > > >No there isn't, as far as I know. Potential solutions to this would be: > > > > 1. Leave the data in the broker long enough for it to reach the size > you > > want. Running the SimpleKafkaETLJob at those intervals would give you > the > > file size you want. This is the simplest thing to do, but the drawback > is > > that your data in HDFS will be less real-time. > > 2. Run the SimpleKafkaETLJob as frequently as you want, and then roll > up > > / compact your small files into one bigger file. You would need to > come up > > with the hadoop job that does the roll up, or find one somewhere. > > 3. Don't use the SimpleKafkaETLJob at all and write a new job that > makes > > use of hadoop append instead... > > > >Also, you may be interested to take a look at these > >scripts< > http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/ > >I > >posted a while ago. If you follow the links in this post, you can get > >more details about how the scripts work and why it was necessary to do the > >things it does... or you can just use them without reading. They should > >work pretty much out of the box... > > Where I work, we discovered that you can keep a file in HDFS open and > still run MapReduce jobs against the data in that file. What you do is you > flush the data periodically (every record for us), but you don't close the > file right away. This allows us to have data files that contain 24 hours > worth of data, but not have to close the file to run the jobs or to > schedule the jobs for after the file is closed. You can also check the > file size periodically and rotate the files based on size. We use Avro > files, but sequence files should work too according to Cloudera. > > It's a great compromise for when you want the latest and greatest data, > but don't want to have to wait until all of the files are closed to get it. > > Casey