Map-side aggregation and map-side joins on the clustered keys should perform better?
________________________________ From: Zheng Shao <[email protected]> Reply-To: <[email protected]> Date: Wed, 28 Oct 2009 15:25:37 -0700 To: <[email protected]> Subject: Re: Creating and populating bucketed tables Hi Ryan, We never tried multi-column bucket so I am not sure. You will get better performance when you do "SELECT ... FROM mytable TABLESAMPLE(BUCKET 1 OUT OF 256);". Hive will submit only one file out of the 256 files from the table to the map-reduce job. Zheng On Wed, Oct 28, 2009 at 3:19 PM, Ryan LeCompte <[email protected]> wrote: Zheng, Thanks for your response.. Another question: In the example on that wiki page, there is: CREATE TABLE user_info_bucketed(userid BIGINT, firstname STRING, lastname STRING) COMMENT 'A bucketed copy of user_info' PARTITIONED BY(ds STRING) CLUSTERED BY(userid) INTO 256 BUCKETS; Is it possible to specify more than one key in the CLUSTERED BY(...) clause? Also, if I am clustering my tables, where/when would I expect to get improved performance in Hive queries? Thanks, Ryan On Sat, Oct 24, 2009 at 6:56 PM, Zheng Shao <[email protected]> wrote: I think Hive internally uses the file name to guess what the bucket the file should belong to. If your data have the same number of files as buckets of the data, and the files are named "part-00000", "part-00001", .., then it will work by just loading the files into the table. This also requires you to know the hashcode of the "bucket key". If the "bucket key" is an integer/long, the row with key "x" should belong to bucket "x % BUCKETS", so make sure you put that row into "part-0000<y>" where y = "x % BUCKETS" Zheng On Sat, Oct 24, 2009 at 3:00 AM, Ryan LeCompte <[email protected]> wrote: Hello, I am trying to create a table that is bucketed and sorted by various columns. My table is created as a sequence file, and I'm populating it with the LOAD DATA command. However, I just came across this wiki page (http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables) which says that the data will NOT be bucketed when inserted into the table. It gives an example of using the CLUSTER BY command in a SELECT statement to insert the data into the table. Is it possible to somehow get the same effect by using the LOAD DATA command? Or do I have to create a separate bucketed and non-bucketed table for my data and move it around like the example in the link above indicates? Thanks, Ryan -- Yours, Zheng -- Yours, Zheng
