I would highly push you to flume 1.x!

Answers inline


--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 15, 2012, at 5:06 AM, lulynn_2008 wrote:

> Hi all,
> I'm trying to capture the results of a twitter search left open.  How    
> do I do this correctly?  What I have done is the following:              
>                                                                          
> I made a file called rp.  It contains the line "track=ron paul"          
>                                                                          
> I called curl and the twitter API to capture tweets:                     
> curl -d @rp https://stream.twitter.com/1/statuses/filter.json -          
> umikedev10:password > twitter                                            
>                                                                          
> I set up flume jobs to read this data and put it into hadoop.  They are  
> named agent and collector and their format is as follows:                
>                                                                          
> agent: tail("/home/flume/twitter") | agentSink("localhost",35853);    
> collector: collectorSource(35853) | collectorSink("hdfs://localhost:     
> 9000/user/flume/twitter/", "twitter");                                
>                                                                          
> On to questions/problems I've already found I'm having:                  
>                                                                          
> 1.  I see the files generated are /user/hdpadmin/twitter/twitterlog.     
> 00000025.20120419- .... .seq - they are generated every 30 seconds.  A   
> zero byte file is generated if there are no new tweets.  How do I        
> generate less of these?  

In flume 0.9x you can use lazyOpen:
tail("foo") | batch(100) lazyOpen stubbornAppend logicalSink("hdfs://...");

Url: 
http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_translations_of_high_level_sources_and_sinks

>                                                
> 2.  Going through flume seems to add escape characters.  The source      
> file twitter has "retweet_count":0  while hdfs has loaded \"             
> retweet_count\":0  instead.  How do I bring things over without these    
> new characters?                            

You can use regex in a sink. Or use own written decorators (Plugins). Take a 
search at github, here you'll find some.

>                              
> 3.  Am I correct in understanding JSON can be in ascii or binary format? 

http://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64

> 4.  Each JSON ascii file appears to need a start/ending bracket, [ and   
> ] - how do I get those in the files flume is generating?    

Use json instead raw in flume-conf.xml

>             
>                                                                          
>  I did try changing my collector sink by appending the arguments         
> 300000, json to the call, which I figured would give me 5 minute         
> intervals and hopefully create json files that simply look exactly like  
> the twitter output but have [ ] surrounding the start and end of the     
> file.  Not sure if I should expect it to be ascii or binary in hdfs.     
> In any case, 300000 alone works, but it doesn't like me trying to pass   
> the format in there and it seems missing from the error message.         
>                                                                          
> com.cloudera.flume.conf.FlumeArgException: usage: collectorSink[(dfsdir, 
> path[,rollmillis])]

Then the config is wrong or you did not use escape sequences. 
http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html

> 
> Would love any help!  
> 
> 

Use flume 1.x ;)

- Alex


Reply via email to