Re: Hadoop Consumer

Murtaza Doctor Thu, 12 Jul 2012 12:57:48 -0700

Hello Min,

In your github project source code are you missing the ConsumerConfig
class? I was trying to download and play with the source code.


Thanks,
murtaza

On 7/3/12 6:29 PM, "Min" <mini...@gmail.com> wrote:

>I've created another hadoop consumer which uses zookeeper.
>
>https://github.com/miniway/kafka-hadoop-consumer
>
>With a hadoop OutputFormatter, I could add new files to the existing
>target directory.
>Hope this would help.
>
>Thanks
>Min
>
>2012/7/4 Murtaza Doctor <murt...@richrelevance.com>:
>> +1 This surely sounds interesting.
>>
>> On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:
>>
>>>Hmm that's surprising. I didn't know about that...!
>>>
>>>I wonder if it's a new feature... Judging from your email, I assume
>>>you're
>>>using CDH? What version?
>>>
>>>Interesting :) ...
>>>
>>>--
>>>Felix
>>>
>>>
>>>
>>>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
>>>casey.sybra...@six3systems.com> wrote:
>>>
>>>> >> - Is there a version of consumer which appends to an existing file
>>>>on
>>>> HDFS
>>>> >> until it reaches a specific size?
>>>> >>
>>>> >
>>>> >No there isn't, as far as I know. Potential solutions to this would
>>>>be:
>>>> >
>>>> >   1. Leave the data in the broker long enough for it to reach the
>>>>size
>>>> you
>>>> >   want. Running the SimpleKafkaETLJob at those intervals would give
>>>>you
>>>> the
>>>> >   file size you want. This is the simplest thing to do, but the
>>>>drawback
>>>> is
>>>> >   that your data in HDFS will be less real-time.
>>>> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
>>>>roll
>>>> up
>>>> >   / compact your small files into one bigger file. You would need to
>>>> come up
>>>> >   with the hadoop job that does the roll up, or find one somewhere.
>>>> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>>>> makes
>>>> >   use of hadoop append instead...
>>>> >
>>>> >Also, you may be interested to take a look at these
>>>> >scripts<
>>>>
>>>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consume
>>>>r/
>>>> >I
>>>> >posted a while ago. If you follow the links in this post, you can get
>>>> >more details about how the scripts work and why it was necessary to
>>>>do
>>>>the
>>>> >things it does... or you can just use them without reading. They
>>>>should
>>>> >work pretty much out of the box...
>>>>
>>>> Where I work, we discovered that you can keep a file in HDFS open and
>>>> still run MapReduce jobs against the data in that file.  What you do
>>>>is
>>>>you
>>>> flush the data periodically (every record for us), but you don't close
>>>>the
>>>> file right away.  This allows us to have data files that contain 24
>>>>hours
>>>> worth of data, but not have to close the file to run the jobs or to
>>>> schedule the jobs for after the file is closed.  You can also check
>>>>the
>>>> file size periodically and rotate the files based on size.  We use
>>>>Avro
>>>> files, but sequence files should work too according to Cloudera.
>>>>
>>>> It's a great compromise for when you want the latest and greatest
>>>>data,
>>>> but don't want to have to wait until all of the files are closed to
>>>>get
>>>>it.
>>>>
>>>> Casey
>>

Re: Hadoop Consumer

Reply via email to