Parallel Loading to S3! (This is not a bug but it could become one).
--------------------------------------------------------------------
Key: HIVE-1714
URL: https://issues.apache.org/jira/browse/HIVE-1714
Project: Hadoop Hive
Issue Type: Task
Affects Versions: 0.6.0
Environment: Hadoop 0.20, Hive (Latest from the Trunk, 0.6 or higher
is good), S3 cluster with 5 nodes and Hive running on the master. (Cluster
launched by CDH Scripts).
Reporter: Appan
Here is my scenario:
I am trying to load data from S3, partition it and load the data back to S3 for
later use. I tried the steps below and the data is partitioned correctly. The
question I have is the upload back to S3 also parallel or sequential?
Here is the set of queries I am running on the "Master Node".
#S3 configurations for Hive.
set fs.s3n.awsAccessKeyId=<mykey>;set
fs.s3n.awsSecretAccessKey=<myaccesskey>;set mapred.map.tasks=<N>;
#setting properties for dynamic partition
set hive.exec.dynamic.partition.mode=nonstrict;set
hive.exec.dynamic.partition=true;set
hive.exec.max.dynamic.partitions.pernode=200000;
# creating external pkvs3 table.
create external table pkvs3 (key int, values string) location
's3n://data.s3ndemo.hive/kv';
#create an external table on s3 which will have the partitioned data.
create external table pkvs3part (values string) partitioned by (key int)
location 's3n://<myhivebucket>/pkvs3part';
#read from s3 and write back to s3.
insert overwrite table pkvs3part partition(key) select key,values from pkvs3
limit 10000; ##*******CHOOSE A SMALLER N for testing*********
The above steps worked fine on a S3 cluster with 5 nodes but it took 10mins+ to
complete for 10K records :-(
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.