Let's say I have an index, my_twitter_river, which has been populated by 
the Twitter river plugin.  I want to do some analysis on the data using a 
Hadoop streaming job (http://hadoop.apache.org/docs/r1.2.1/streaming.html), 
where essentially command line programs can be used to write map/reduce 
jobs.

Is there a good way for me to have the mapper for that streaming job 
receive parsable json using only available configuration options?  What I 
see is this:

my-macbook:misc psheridan$ hadoop jar 
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar
  -D es.resource=my_twitter_river/status -D mapred.reduce.tasks=0
  -inputformat org.elasticsearch.hadoop.mr.EsInputFormat
  -mapper /bin/cat -input in -output out

14/05/05 08:54:13 INFO mr.EsInputFormat: Discovered mapping 
{my_twitter_river=[mappings=[status=[created_at=DATE, hashtag=[end=LONG, 
start=LONG, text=STRING], in_reply=[status=LONG, user_id=LONG, 
user_screen_name=STRING], language=STRING, link=[display_url=STRING, 
end=LONG, expand_url=STRING, start=LONG, url=STRING], location=GEO_POINT, 
mention=[end=LONG, id=LONG, name=STRING, screen_name=STRING, start=LONG], 
place=[country=STRING, country_code=STRING, full_name=STRING, id=STRING, 
name=STRING, type=STRING, url=STRING], retweet=[id=LONG, 
retweet_count=LONG, user_id=LONG, user_screen_name=STRING], 
retweet_count=LONG, source=STRING, text=STRING, truncated=BOOLEAN, 
user=[description=STRING, id=LONG, location=STRING, name=STRING, 
profile_image_url=STRING, profile_image_url_https=STRING, 
screen_name=STRING]]]]} for [my_twitter_river/status]
14/05/05 08:54:13 INFO mr.EsInputFormat: Created [5] shard-splits

 [...output snipped...]

my-macbook:misc psheridan$ hadoop dfs -ls out
Found 7 items
-rw-r--r--   1 psheridan supergroup          0 2014-05-05 08:54 
/user/psheridan/out/_SUCCESS
drwxr-xr-x   - psheridan supergroup          0 2014-05-05 08:54 
/user/psheridan/out/_logs
-rw-r--r--   1 psheridan supergroup   10541885 2014-05-05 08:54 
/user/psheridan/out/part-00000
-rw-r--r--   1 psheridan supergroup   10252834 2014-05-05 08:54 
/user/psheridan/out/part-00001
-rw-r--r--   1 psheridan supergroup   10492008 2014-05-05 08:54 
/user/psheridan/out/part-00002
-rw-r--r--   1 psheridan supergroup   10497346 2014-05-05 08:54 
/user/psheridan/out/part-00003
-rw-r--r--   1 psheridan supergroup   10489611 2014-05-05 08:54 
/user/psheridan/out/part-00004

my-macbook:misc psheridan$ dfs -cat /user/psheridan/out/part-00000 | head -2

458635348029886464 {text=RT @adrianrmante: Having lunch with my great 
friend OH MR. MOOSSEEBBYY!! I love that guy for life!! 
#thesuitelifeofzackandcody http://t.co/…, 
created_at=2014-04-22T15:56:02.000Z, source=<a 
href="http://twitter.com/download/iphone"; rel="nofollow">Twitter for 
iPhone</a>, truncated=false, language=en, mention=[{id=38270235, 
name=Adrian R'Mante, screen_name=adrianrmante, start=3, end=16}], 
retweet_count=0, retweet={id=456201545562873856, user_id=38270235, 
user_screen_name=adrianrmante, retweet_count=3279}, 
hashtag=[{text=thesuitelifeofzackandcody, start=100, end=126}], link=[], 
user={id=467951508, name=Marenna Nonya, screen_name=ThatGirlish, 
location=(null), description=(null), 
profile_image_url=http://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg,
 
profile_image_url_https=https://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg}}
458635352232976384 {text=Dae vc sobe com seu amigão, ele fica de love e vc 
separando briga --' K, created_at=2014-04-22T15:56:03.000Z, source=<a 
href="http://twitter.com/download/android"; rel="nofollow">Twitter for 
Android</a>, truncated=false, language=pt, mention=[], retweet_count=0, 
hashtag=[], link=[], user={id=2187296905, name=Carlinhos , 
screen_name=AngeloChionpat0, location=(null), description=(null), 
profile_image_url=http://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg,
 
profile_image_url_https=https://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg}}


When I cat the output files, I'd like to see valid json for the value.

Thanks for any assistance...if this isn't possible, I'll likely submit a 
pull request when I get to it, which will not be as soon as I'd like.  :)

--Pete

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a3fc229e-4bd3-4c31-ae0a-bec7362ba724%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to