Hi all,
I'm trying to capture the results of a twitter search left open.  How   
do I do this correctly?  What I have done is the following:             
                                                                        
I made a file called rp.  It contains the line "track=ron paul"         
                                                                        
I called curl and the twitter API to capture tweets:                    
curl -d @rp https://stream.twitter.com/1/statuses/filter.json -         
umikedev10:password > twitter                                           
                                                                        
I set up flume jobs to read this data and put it into hadoop.  They are 
named agent and collector and their format is as follows:               
                                                                        
agent: tail("/home/flume/twitter") | agentSink("localhost",35853);   
collector: collectorSource(35853) | collectorSink("hdfs://localhost:    
9000/user/flume/twitter/", "twitter");                               
                                                                        
On to questions/problems I've already found I'm having:                 
                                                                        
1.  I see the files generated are /user/hdpadmin/twitter/twitterlog.    
00000025.20120419- .... .seq - they are generated every 30 seconds.  A  
zero byte file is generated if there are no new tweets.  How do I       
generate less of these?                                                 
2.  Going through flume seems to add escape characters.  The source     
file twitter has "retweet_count":0  while hdfs has loaded \"            
retweet_count\":0  instead.  How do I bring things over without these   
new characters?                                                         
3.  Am I correct in understanding JSON can be in ascii or binary format?
4.  Each JSON ascii file appears to need a start/ending bracket, [ and  
] - how do I get those in the files flume is generating?                
                                                                        
 I did try changing my collector sink by appending the arguments        
300000, json to the call, which I figured would give me 5 minute        
intervals and hopefully create json files that simply look exactly like 
the twitter output but have [ ] surrounding the start and end of the    
file.  Not sure if I should expect it to be ascii or binary in hdfs.    
In any case, 300000 alone works, but it doesn't like me trying to pass  
the format in there and it seems missing from the error message.        
                                                                        
com.cloudera.flume.conf.FlumeArgException: usage: collectorSink[(dfsdir,
path[,rollmillis])]

Would love any help! 

Reply via email to