Re: Hadoop Consumer

Murtaza Doctor Tue, 03 Jul 2012 10:57:17 -0700

+1 This surely sounds interesting.

On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:


>Hmm that's surprising. I didn't know about that...!
>
>I wonder if it's a new feature... Judging from your email, I assume you're
>using CDH? What version?
>
>Interesting :) ...
>
>--
>Felix
>
>
>
>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
>casey.sybra...@six3systems.com> wrote:
>
>> >> - Is there a version of consumer which appends to an existing file on
>> HDFS
>> >> until it reaches a specific size?
>> >>
>> >
>> >No there isn't, as far as I know. Potential solutions to this would be:
>> >
>> >   1. Leave the data in the broker long enough for it to reach the size
>> you
>> >   want. Running the SimpleKafkaETLJob at those intervals would give
>>you
>> the
>> >   file size you want. This is the simplest thing to do, but the
>>drawback
>> is
>> >   that your data in HDFS will be less real-time.
>> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
>>roll
>> up
>> >   / compact your small files into one bigger file. You would need to
>> come up
>> >   with the hadoop job that does the roll up, or find one somewhere.
>> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>> makes
>> >   use of hadoop append instead...
>> >
>> >Also, you may be interested to take a look at these
>> >scripts<
>> 
>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
>> >I
>> >posted a while ago. If you follow the links in this post, you can get
>> >more details about how the scripts work and why it was necessary to do
>>the
>> >things it does... or you can just use them without reading. They should
>> >work pretty much out of the box...
>>
>> Where I work, we discovered that you can keep a file in HDFS open and
>> still run MapReduce jobs against the data in that file.  What you do is
>>you
>> flush the data periodically (every record for us), but you don't close
>>the
>> file right away.  This allows us to have data files that contain 24
>>hours
>> worth of data, but not have to close the file to run the jobs or to
>> schedule the jobs for after the file is closed.  You can also check the
>> file size periodically and rotate the files based on size.  We use Avro
>> files, but sequence files should work too according to Cloudera.
>>
>> It's a great compromise for when you want the latest and greatest data,
>> but don't want to have to wait until all of the files are closed to get
>>it.
>>
>> Casey

Re: Hadoop Consumer

Reply via email to