Anyone have code that does incremental S3? Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
On Jan 25, 2012, at 8:36 AM, Felix GV <fe...@mate1inc.com> wrote: > Yeah those shell scripts are basically the continuation of what I was doing > in my last blog posts. I planned to make new blog posts about them but I > just never got around to it. Then I saw your message and it gave me the > little kick in the arse I needed to at least gist those things :) ... > > Hopefully, it can save you some time :) ! > > -- > Felix > > > > On Wed, Jan 25, 2012 at 3:30 AM, Paul Ingles <p...@forward.co.uk> wrote: > >> Thanks Felix- I found your blog posts before and it really helped me >> figure out how to get things working so I'll definitely give the shell >> scripts a run. >> >> >> >> On 24 Jan 2012, at 19:05, Felix GV wrote: >> >>> Hello :) >>> >>> For question 1: >>> >>> The hadoop consumer in the contrib directory has almost everything it >> needs >>> to do distributed incremental imports out of the box, but it requires a >> bit >>> of hand holding. >>> >>> I've created two scripts to automate the process. One of them generates >>> initial offset files, and the other does incremental hadoop consumption. >>> >>> I personally use a cron job to periodically call the incremental consumer >>> script with specific parameters (for topic and HDFS path output). >>> >>> You can find all of the required files in this gist: >>> https://gist.github.com/1671887 >>> >>> The LinkedIn guys promised to release their full Hadoop/Kafka ETL code >>> eventually but I think they didn't have time to get around to it yet. >> When >>> they do release it, it's probably going to be better than my scripts, but >>> for now, I think those scripts are the only publically available way to >> do >>> this stuff without writing it yourself. >>> >>> I don't know about question 2 and 3. >>> >>> I hope this helps :) ! >>> >>> -- >>> Felix >>> >>> >>> >>> On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <p...@forward.co.uk> wrote: >>> >>>> Hi, >>>> >>>> I'm investigating using Kafka and would really appreciate getting some >>>> more experienced opinion on the way things work together. >>>> >>>> Our application instances are creating Protocol Buffer serialized >> messages >>>> and pushing them to topics in Kafka: >>>> >>>> * Web log requests >>>> * Product details viewed >>>> * Search performed >>>> * Email registered >>>> etc... >>>> >>>> I would like to be able to perform incremental loads from these topics >>>> into HDFS and then into the rest of the batch processing. I guess I had >> 3 >>>> broad questions >>>> >>>> 1) How do people trigger the batch loads? Do you just point your >>>> SimpleKafkaETLJob input to the previous runs outputted offset file? Do >> you >>>> move files between runs of the SimpleKafkaETLJob- move the part-* file >> into >>>> one place and move the offsets into an input directory ready for the >> next >>>> run? >>>> >>>> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper >>>> outputs Long/Text writables and is marked as deprecated (this is in the >> 0.7 >>>> source). Is there an alternative class that should be used instead, or >> is >>>> the hadoop-consumer being deprecated overall? >>>> >>>> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines, >>>> are most people using Kafka for passing text messages around or using >> JSON >>>> data etc.? >>>> >>>> Thanks, >>>> Paul >> >>