Re: Splitting data input to Distcp

Pedro Figueiredo Fri, 04 May 2012 00:56:03 -0700

On 3 May 2012, at 23:47, Himanshu Vijay wrote:

> Pedro,
> 
> Thanks for the response. Unfortunately I am running it on in-house cluster
> and from there I need to upload to S3.
>


Hi,

Last night I was thinking about this... what happens if you copy

s3://region.elasticmapreduce/libs/s3distcp/1.0.1/s3distcp.jar

to your cluster and run

hadoop jar s3distcp.jar --src hdfs:///path/to/files --dest s3://bucket/path 
--outputCodec lzo (or what have you)

?

Alternatively, you could run the following Pig or Hive jobs (using output 
compression):

--- pig ---
local_data = load '/path/to/files' as ( ... );
store local_data into 's3://bucket/path' using ...;

--- hive ---
create external table foo (
  ...
)
[row format ... | serde]
location '/path/to/files';

create external table s3_foo (
  ...
)
[row format ... | serde]
location 's3://bucket/path';

insert overwrite table s3_foo
select * from foo;

Obviously an equivalent Native or Streaming job is trivial to write, too.

Cheers,

Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting

Re: Splitting data input to Distcp

Reply via email to