>> - Is there a version of consumer which appends to an existing file on HDFS
>> until it reaches a specific size?
>>
>
>No there isn't, as far as I know. Potential solutions to this would be:
>
>   1. Leave the data in the broker long enough for it to reach the size you
>   want. Running the SimpleKafkaETLJob at those intervals would give you the
>   file size you want. This is the simplest thing to do, but the drawback is
>   that your data in HDFS will be less real-time.
>   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll up
>   / compact your small files into one bigger file. You would need to come up
>   with the hadoop job that does the roll up, or find one somewhere.
>   3. Don't use the SimpleKafkaETLJob at all and write a new job that makes
>   use of hadoop append instead...
>
>Also, you may be interested to take a look at these
>scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/>I
>posted a while ago. If you follow the links in this post, you can get
>more details about how the scripts work and why it was necessary to do the
>things it does... or you can just use them without reading. They should
>work pretty much out of the box...

Where I work, we discovered that you can keep a file in HDFS open and still run 
MapReduce jobs against the data in that file.  What you do is you flush the 
data periodically (every record for us), but you don't close the file right 
away.  This allows us to have data files that contain 24 hours worth of data, 
but not have to close the file to run the jobs or to schedule the jobs for 
after the file is closed.  You can also check the file size periodically and 
rotate the files based on size.  We use Avro files, but sequence files should 
work too according to Cloudera.

It's a great compromise for when you want the latest and greatest data, but 
don't want to have to wait until all of the files are closed to get it.

Casey

Reply via email to