Parallel Loading to S3! (This is not a bug but it could become one). --------------------------------------------------------------------
Key: HIVE-1714 URL: https://issues.apache.org/jira/browse/HIVE-1714 Project: Hadoop Hive Issue Type: Task Affects Versions: 0.6.0 Environment: Hadoop 0.20, Hive (Latest from the Trunk, 0.6 or higher is good), S3 cluster with 5 nodes and Hive running on the master. (Cluster launched by CDH Scripts). Reporter: Appan Here is my scenario: I am trying to load data from S3, partition it and load the data back to S3 for later use. I tried the steps below and the data is partitioned correctly. The question I have is the upload back to S3 also parallel or sequential? Here is the set of queries I am running on the "Master Node". #S3 configurations for Hive. set fs.s3n.awsAccessKeyId=<mykey>;set fs.s3n.awsSecretAccessKey=<myaccesskey>;set mapred.map.tasks=<N>; #setting properties for dynamic partition set hive.exec.dynamic.partition.mode=nonstrict;set hive.exec.dynamic.partition=true;set hive.exec.max.dynamic.partitions.pernode=200000; # creating external pkvs3 table. create external table pkvs3 (key int, values string) location 's3n://data.s3ndemo.hive/kv'; #create an external table on s3 which will have the partitioned data. create external table pkvs3part (values string) partitioned by (key int) location 's3n://<myhivebucket>/pkvs3part'; #read from s3 and write back to s3. insert overwrite table pkvs3part partition(key) select key,values from pkvs3 limit 10000; ##*******CHOOSE A SMALLER N for testing********* The above steps worked fine on a S3 cluster with 5 nodes but it took 10mins+ to complete for 10K records :-( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.