Re: Hadoop Consumer

Min Sun, 15 Jul 2012 20:23:52 -0700

ConsumerConfig is in the kafka's main trunk.

As I used the same package namespace, kafka.consumer, (sure I don't
think it's good approach), I didn't have to import it explicitly.


kafka jar is not on the maven repository, you might have to register
it into your local maven repository.

> mvn install:install-file -Dfile=kafka-0.7.0.jar -DgroupId=kafka 
> -DartifactId=kafka -Dversion=0.7.0 -Dpackaging=jar

Thanks
Min

2012/7/13 Murtaza Doctor <murt...@richrelevance.com>:
> Hello Min,
>
> In your github project source code are you missing the ConsumerConfig
> class? I was trying to download and play with the source code.
>
> Thanks,
> murtaza
>
> On 7/3/12 6:29 PM, "Min" <mini...@gmail.com> wrote:
>
>>I've created another hadoop consumer which uses zookeeper.
>>
>>https://github.com/miniway/kafka-hadoop-consumer
>>
>>With a hadoop OutputFormatter, I could add new files to the existing
>>target directory.
>>Hope this would help.
>>
>>Thanks
>>Min
>>
>>2012/7/4 Murtaza Doctor <murt...@richrelevance.com>:
>>> +1 This surely sounds interesting.
>>>
>>> On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:
>>>
>>>>Hmm that's surprising. I didn't know about that...!
>>>>
>>>>I wonder if it's a new feature... Judging from your email, I assume
>>>>you're
>>>>using CDH? What version?
>>>>
>>>>Interesting :) ...
>>>>
>>>>--
>>>>Felix
>>>>
>>>>
>>>>
>>>>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
>>>>casey.sybra...@six3systems.com> wrote:
>>>>
>>>>> >> - Is there a version of consumer which appends to an existing file
>>>>>on
>>>>> HDFS
>>>>> >> until it reaches a specific size?
>>>>> >>
>>>>> >
>>>>> >No there isn't, as far as I know. Potential solutions to this would
>>>>>be:
>>>>> >
>>>>> >   1. Leave the data in the broker long enough for it to reach the
>>>>>size
>>>>> you
>>>>> >   want. Running the SimpleKafkaETLJob at those intervals would give
>>>>>you
>>>>> the
>>>>> >   file size you want. This is the simplest thing to do, but the
>>>>>drawback
>>>>> is
>>>>> >   that your data in HDFS will be less real-time.
>>>>> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
>>>>>roll
>>>>> up
>>>>> >   / compact your small files into one bigger file. You would need to
>>>>> come up
>>>>> >   with the hadoop job that does the roll up, or find one somewhere.
>>>>> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>>>>> makes
>>>>> >   use of hadoop append instead...
>>>>> >
>>>>> >Also, you may be interested to take a look at these
>>>>> >scripts<
>>>>>
>>>>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consume
>>>>>r/
>>>>> >I
>>>>> >posted a while ago. If you follow the links in this post, you can get
>>>>> >more details about how the scripts work and why it was necessary to
>>>>>do
>>>>>the
>>>>> >things it does... or you can just use them without reading. They
>>>>>should
>>>>> >work pretty much out of the box...
>>>>>
>>>>> Where I work, we discovered that you can keep a file in HDFS open and
>>>>> still run MapReduce jobs against the data in that file.  What you do
>>>>>is
>>>>>you
>>>>> flush the data periodically (every record for us), but you don't close
>>>>>the
>>>>> file right away.  This allows us to have data files that contain 24
>>>>>hours
>>>>> worth of data, but not have to close the file to run the jobs or to
>>>>> schedule the jobs for after the file is closed.  You can also check
>>>>>the
>>>>> file size periodically and rotate the files based on size.  We use
>>>>>Avro
>>>>> files, but sequence files should work too according to Cloudera.
>>>>>
>>>>> It's a great compromise for when you want the latest and greatest
>>>>>data,
>>>>> but don't want to have to wait until all of the files are closed to
>>>>>get
>>>>>it.
>>>>>
>>>>> Casey
>>>
>

Re: Hadoop Consumer

Reply via email to