Parallel Loading to S3! (This is not a bug but it could become one).
--------------------------------------------------------------------

                 Key: HIVE-1714
                 URL: https://issues.apache.org/jira/browse/HIVE-1714
             Project: Hadoop Hive
          Issue Type: Task
    Affects Versions: 0.6.0
         Environment: Hadoop 0.20, Hive (Latest from the Trunk, 0.6 or higher 
is good), S3 cluster with 5 nodes and Hive running on the master. (Cluster 
launched by CDH Scripts).
            Reporter: Appan


Here is my scenario:

I am trying to load data from S3, partition it and load the data back to S3 for 
later use. I tried the steps below and the data is partitioned correctly. The 
question I have is the upload back to S3 also parallel or sequential? 

Here is the set of queries I am running on the "Master Node".

#S3 configurations for Hive.
set fs.s3n.awsAccessKeyId=<mykey>;set 
fs.s3n.awsSecretAccessKey=<myaccesskey>;set mapred.map.tasks=<N>;

#setting properties for dynamic partition
set hive.exec.dynamic.partition.mode=nonstrict;set 
hive.exec.dynamic.partition=true;set 
hive.exec.max.dynamic.partitions.pernode=200000;

# creating external pkvs3 table.
create external table pkvs3 (key int, values string)  location 
's3n://data.s3ndemo.hive/kv';

#create an external table on s3 which will have the partitioned data.
create external table pkvs3part (values string) partitioned by (key int)  
location 's3n://<myhivebucket>/pkvs3part';

#read from s3 and write back to s3.
insert overwrite table pkvs3part partition(key) select key,values from pkvs3 
limit 10000; ##*******CHOOSE A SMALLER N for testing*********

The above steps worked fine on a S3 cluster with 5 nodes but it took 10mins+ to 
complete for 10K records :-(


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to