I would highly push you to flume 1.x! Answers inline
-- Alexander Alten-Lorenz http://mapredit.blogspot.com German Hadoop LinkedIn Group: http://goo.gl/N8pCF On May 15, 2012, at 5:06 AM, lulynn_2008 wrote: > Hi all, > I'm trying to capture the results of a twitter search left open. How > do I do this correctly? What I have done is the following: > > I made a file called rp. It contains the line "track=ron paul" > > I called curl and the twitter API to capture tweets: > curl -d @rp https://stream.twitter.com/1/statuses/filter.json - > umikedev10:password > twitter > > I set up flume jobs to read this data and put it into hadoop. They are > named agent and collector and their format is as follows: > > agent: tail("/home/flume/twitter") | agentSink("localhost",35853); > collector: collectorSource(35853) | collectorSink("hdfs://localhost: > 9000/user/flume/twitter/", "twitter"); > > On to questions/problems I've already found I'm having: > > 1. I see the files generated are /user/hdpadmin/twitter/twitterlog. > 00000025.20120419- .... .seq - they are generated every 30 seconds. A > zero byte file is generated if there are no new tweets. How do I > generate less of these? In flume 0.9x you can use lazyOpen: tail("foo") | batch(100) lazyOpen stubbornAppend logicalSink("hdfs://..."); Url: http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_translations_of_high_level_sources_and_sinks > > 2. Going through flume seems to add escape characters. The source > file twitter has "retweet_count":0 while hdfs has loaded \" > retweet_count\":0 instead. How do I bring things over without these > new characters? You can use regex in a sink. Or use own written decorators (Plugins). Take a search at github, here you'll find some. > > 3. Am I correct in understanding JSON can be in ascii or binary format? http://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64 > 4. Each JSON ascii file appears to need a start/ending bracket, [ and > ] - how do I get those in the files flume is generating? Use json instead raw in flume-conf.xml > > > I did try changing my collector sink by appending the arguments > 300000, json to the call, which I figured would give me 5 minute > intervals and hopefully create json files that simply look exactly like > the twitter output but have [ ] surrounding the start and end of the > file. Not sure if I should expect it to be ascii or binary in hdfs. > In any case, 300000 alone works, but it doesn't like me trying to pass > the format in there and it seems missing from the error message. > > com.cloudera.flume.conf.FlumeArgException: usage: collectorSink[(dfsdir, > path[,rollmillis])] Then the config is wrong or you did not use escape sequences. http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html > > Would love any help! > > Use flume 1.x ;) - Alex
