I've created another hadoop consumer which uses zookeeper.

https://github.com/miniway/kafka-hadoop-consumer

With a hadoop OutputFormatter, I could add new files to the existing
target directory.
Hope this would help.

Thanks
Min

2012/7/4 Murtaza Doctor <murt...@richrelevance.com>:
> +1 This surely sounds interesting.
>
> On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:
>
>>Hmm that's surprising. I didn't know about that...!
>>
>>I wonder if it's a new feature... Judging from your email, I assume you're
>>using CDH? What version?
>>
>>Interesting :) ...
>>
>>--
>>Felix
>>
>>
>>
>>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
>>casey.sybra...@six3systems.com> wrote:
>>
>>> >> - Is there a version of consumer which appends to an existing file on
>>> HDFS
>>> >> until it reaches a specific size?
>>> >>
>>> >
>>> >No there isn't, as far as I know. Potential solutions to this would be:
>>> >
>>> >   1. Leave the data in the broker long enough for it to reach the size
>>> you
>>> >   want. Running the SimpleKafkaETLJob at those intervals would give
>>>you
>>> the
>>> >   file size you want. This is the simplest thing to do, but the
>>>drawback
>>> is
>>> >   that your data in HDFS will be less real-time.
>>> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
>>>roll
>>> up
>>> >   / compact your small files into one bigger file. You would need to
>>> come up
>>> >   with the hadoop job that does the roll up, or find one somewhere.
>>> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>>> makes
>>> >   use of hadoop append instead...
>>> >
>>> >Also, you may be interested to take a look at these
>>> >scripts<
>>>
>>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
>>> >I
>>> >posted a while ago. If you follow the links in this post, you can get
>>> >more details about how the scripts work and why it was necessary to do
>>>the
>>> >things it does... or you can just use them without reading. They should
>>> >work pretty much out of the box...
>>>
>>> Where I work, we discovered that you can keep a file in HDFS open and
>>> still run MapReduce jobs against the data in that file.  What you do is
>>>you
>>> flush the data periodically (every record for us), but you don't close
>>>the
>>> file right away.  This allows us to have data files that contain 24
>>>hours
>>> worth of data, but not have to close the file to run the jobs or to
>>> schedule the jobs for after the file is closed.  You can also check the
>>> file size periodically and rotate the files based on size.  We use Avro
>>> files, but sequence files should work too according to Cloudera.
>>>
>>> It's a great compromise for when you want the latest and greatest data,
>>> but don't want to have to wait until all of the files are closed to get
>>>it.
>>>
>>> Casey
>

Reply via email to