Hi,

Reading data in JSON format from ES (which I think is what you are interested 
in doing) is not available out of the box.
Simply because you can do the same thing directly from the command line with 
curl or any http-like client.
One of the reasons behind hadoop-streaming is to allow native clients to interact with Hadoop, primarily with HDF; since you are
interacting with ES, why not talk to it directly?

Am I missing something?


On 5/5/14 4:06 PM, Peter Sheridan wrote:
Let's say I have an index, my_twitter_river, which has been populated by the 
Twitter river plugin.  I want to do some
analysis on the data using a Hadoop streaming job 
(http://hadoop.apache.org/docs/r1.2.1/streaming.html), where
essentially command line programs can be used to write map/reduce jobs.

Is there a good way for me to have the mapper for that streaming job receive 
parsable json using only available
configuration options?  What I see is this:

my-macbook:misc psheridan$ hadoop jar 
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar
   -D es.resource=my_twitter_river/status -D mapred.reduce.tasks=0
   -inputformat org.elasticsearch.hadoop.mr.EsInputFormat
   -mapper /bin/cat -input in -output out

14/05/05 08:54:13 INFO mr.EsInputFormat: Discovered mapping 
{my_twitter_river=[mappings=[status=[created_at=DATE,
hashtag=[end=LONG, start=LONG, text=STRING], in_reply=[status=LONG, 
user_id=LONG, user_screen_name=STRING],
language=STRING, link=[display_url=STRING, end=LONG, expand_url=STRING, 
start=LONG, url=STRING], location=GEO_POINT,
mention=[end=LONG, id=LONG, name=STRING, screen_name=STRING, start=LONG], 
place=[country=STRING, country_code=STRING,
full_name=STRING, id=STRING, name=STRING, type=STRING, url=STRING], 
retweet=[id=LONG, retweet_count=LONG, user_id=LONG,
user_screen_name=STRING], retweet_count=LONG, source=STRING, text=STRING, 
truncated=BOOLEAN, user=[description=STRING,
id=LONG, location=STRING, name=STRING, profile_image_url=STRING, 
profile_image_url_https=STRING, screen_name=STRING]]]]}
for [my_twitter_river/status]
14/05/05 08:54:13 INFO mr.EsInputFormat: Created [5] shard-splits

  [...output snipped...]

my-macbook:misc psheridan$ hadoop dfs -ls out
Found 7 items
-rw-r--r--   1 psheridan supergroup          0 2014-05-05 08:54 
/user/psheridan/out/_SUCCESS
drwxr-xr-x   - psheridan supergroup          0 2014-05-05 08:54 
/user/psheridan/out/_logs
-rw-r--r--   1 psheridan supergroup   10541885 2014-05-05 08:54 
/user/psheridan/out/part-00000
-rw-r--r--   1 psheridan supergroup   10252834 2014-05-05 08:54 
/user/psheridan/out/part-00001
-rw-r--r--   1 psheridan supergroup   10492008 2014-05-05 08:54 
/user/psheridan/out/part-00002
-rw-r--r--   1 psheridan supergroup   10497346 2014-05-05 08:54 
/user/psheridan/out/part-00003
-rw-r--r--   1 psheridan supergroup   10489611 2014-05-05 08:54 
/user/psheridan/out/part-00004

my-macbook:misc psheridan$ dfs -cat /user/psheridan/out/part-00000 | head -2

458635348029886464{text=RT @adrianrmante: Having lunch with my great friend OH 
MR. MOOSSEEBBYY!! I love that guy for
life!! #thesuitelifeofzackandcody http://t.co/…, 
created_at=2014-04-22T15:56:02.000Z, source=<a
href="http://twitter.com/download/iphone"; rel="nofollow">Twitter for 
iPhone</a>, truncated=false, language=en,
mention=[{id=38270235, name=Adrian R'Mante, screen_name=adrianrmante, start=3, 
end=16}], retweet_count=0,
retweet={id=456201545562873856, user_id=38270235, 
user_screen_name=adrianrmante, retweet_count=3279},
hashtag=[{text=thesuitelifeofzackandcody, start=100, end=126}], link=[], 
user={id=467951508, name=Marenna Nonya,
screen_name=ThatGirlish, location=(null), description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg}}
458635352232976384{text=Dae vc sobe com seu amigão, ele fica de love e vc 
separando briga --' K,
created_at=2014-04-22T15:56:03.000Z, source=<a href="http://twitter.com/download/android"; 
rel="nofollow">Twitter for
Android</a>, truncated=false, language=pt, mention=[], retweet_count=0, 
hashtag=[], link=[], user={id=2187296905,
name=Carlinhos , screen_name=AngeloChionpat0, location=(null), 
description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg}}


When I cat the output files, I'd like to see valid json for the value.

Thanks for any assistance...if this isn't possible, I'll likely submit a pull 
request when I get to it, which will not
be as soon as I'd like.  :)

--Pete

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to
[email protected] 
<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a3fc229e-4bd3-4c31-ae0a-bec7362ba724%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/a3fc229e-4bd3-4c31-ae0a-bec7362ba724%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5367F28D.1080502%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to