Hi all,
I'm trying to capture the results of a twitter search left open. How
do I do this correctly? What I have done is the following:
I made a file called rp. It contains the line "track=ron paul"
I called curl and the twitter API to capture tweets:
curl -d @rp https://stream.twitter.com/1/statuses/filter.json -
umikedev10:password > twitter
I set up flume jobs to read this data and put it into hadoop. They are
named agent and collector and their format is as follows:
agent: tail("/home/flume/twitter") | agentSink("localhost",35853);
collector: collectorSource(35853) | collectorSink("hdfs://localhost:
9000/user/flume/twitter/", "twitter");
On to questions/problems I've already found I'm having:
1. I see the files generated are /user/hdpadmin/twitter/twitterlog.
00000025.20120419- .... .seq - they are generated every 30 seconds. A
zero byte file is generated if there are no new tweets. How do I
generate less of these?
2. Going through flume seems to add escape characters. The source
file twitter has "retweet_count":0 while hdfs has loaded \"
retweet_count\":0 instead. How do I bring things over without these
new characters?
3. Am I correct in understanding JSON can be in ascii or binary format?
4. Each JSON ascii file appears to need a start/ending bracket, [ and
] - how do I get those in the files flume is generating?
I did try changing my collector sink by appending the arguments
300000, json to the call, which I figured would give me 5 minute
intervals and hopefully create json files that simply look exactly like
the twitter output but have [ ] surrounding the start and end of the
file. Not sure if I should expect it to be ascii or binary in hdfs.
In any case, 300000 alone works, but it doesn't like me trying to pass
the format in there and it seems missing from the error message.
com.cloudera.flume.conf.FlumeArgException: usage: collectorSink[(dfsdir,
path[,rollmillis])]
Would love any help!