Let's say I have an index, my_twitter_river, which has been populated by the
Twitter river plugin. I want to do some
analysis on the data using a Hadoop streaming job
(http://hadoop.apache.org/docs/r1.2.1/streaming.html), where
essentially command line programs can be used to write map/reduce jobs.
Is there a good way for me to have the mapper for that streaming job receive
parsable json using only available
configuration options? What I see is this:
my-macbook:misc psheridan$ hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar
-D es.resource=my_twitter_river/status -D mapred.reduce.tasks=0
-inputformat org.elasticsearch.hadoop.mr.EsInputFormat
-mapper /bin/cat -input in -output out
14/05/05 08:54:13 INFO mr.EsInputFormat: Discovered mapping
{my_twitter_river=[mappings=[status=[created_at=DATE,
hashtag=[end=LONG, start=LONG, text=STRING], in_reply=[status=LONG,
user_id=LONG, user_screen_name=STRING],
language=STRING, link=[display_url=STRING, end=LONG, expand_url=STRING,
start=LONG, url=STRING], location=GEO_POINT,
mention=[end=LONG, id=LONG, name=STRING, screen_name=STRING, start=LONG],
place=[country=STRING, country_code=STRING,
full_name=STRING, id=STRING, name=STRING, type=STRING, url=STRING],
retweet=[id=LONG, retweet_count=LONG, user_id=LONG,
user_screen_name=STRING], retweet_count=LONG, source=STRING, text=STRING,
truncated=BOOLEAN, user=[description=STRING,
id=LONG, location=STRING, name=STRING, profile_image_url=STRING,
profile_image_url_https=STRING, screen_name=STRING]]]]}
for [my_twitter_river/status]
14/05/05 08:54:13 INFO mr.EsInputFormat: Created [5] shard-splits
[...output snipped...]
my-macbook:misc psheridan$ hadoop dfs -ls out
Found 7 items
-rw-r--r-- 1 psheridan supergroup 0 2014-05-05 08:54
/user/psheridan/out/_SUCCESS
drwxr-xr-x - psheridan supergroup 0 2014-05-05 08:54
/user/psheridan/out/_logs
-rw-r--r-- 1 psheridan supergroup 10541885 2014-05-05 08:54
/user/psheridan/out/part-00000
-rw-r--r-- 1 psheridan supergroup 10252834 2014-05-05 08:54
/user/psheridan/out/part-00001
-rw-r--r-- 1 psheridan supergroup 10492008 2014-05-05 08:54
/user/psheridan/out/part-00002
-rw-r--r-- 1 psheridan supergroup 10497346 2014-05-05 08:54
/user/psheridan/out/part-00003
-rw-r--r-- 1 psheridan supergroup 10489611 2014-05-05 08:54
/user/psheridan/out/part-00004
my-macbook:misc psheridan$ dfs -cat /user/psheridan/out/part-00000 | head -2
458635348029886464{text=RT @adrianrmante: Having lunch with my great friend OH
MR. MOOSSEEBBYY!! I love that guy for
life!! #thesuitelifeofzackandcody http://t.co/…,
created_at=2014-04-22T15:56:02.000Z, source=<a
href="http://twitter.com/download/iphone" rel="nofollow">Twitter for
iPhone</a>, truncated=false, language=en,
mention=[{id=38270235, name=Adrian R'Mante, screen_name=adrianrmante, start=3,
end=16}], retweet_count=0,
retweet={id=456201545562873856, user_id=38270235,
user_screen_name=adrianrmante, retweet_count=3279},
hashtag=[{text=thesuitelifeofzackandcody, start=100, end=126}], link=[],
user={id=467951508, name=Marenna Nonya,
screen_name=ThatGirlish, location=(null), description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/431121751456890880/pItAYIpY_normal.jpeg}}
458635352232976384{text=Dae vc sobe com seu amigão, ele fica de love e vc
separando briga --' K,
created_at=2014-04-22T15:56:03.000Z, source=<a href="http://twitter.com/download/android"
rel="nofollow">Twitter for
Android</a>, truncated=false, language=pt, mention=[], retweet_count=0,
hashtag=[], link=[], user={id=2187296905,
name=Carlinhos , screen_name=AngeloChionpat0, location=(null),
description=(null),
profile_image_url=http://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg,
profile_image_url_https=https://pbs.twimg.com/profile_images/448849601844772864/ZNwdGDC7_normal.jpeg}}
When I cat the output files, I'd like to see valid json for the value.
Thanks for any assistance...if this isn't possible, I'll likely submit a pull
request when I get to it, which will not
be as soon as I'd like. :)
--Pete
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to
[email protected]
<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a3fc229e-4bd3-4c31-ae0a-bec7362ba724%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/a3fc229e-4bd3-4c31-ae0a-bec7362ba724%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.