Hi Ryan, We never tried multi-column bucket so I am not sure.
You will get better performance when you do "SELECT ... FROM mytable TABLESAMPLE(BUCKET 1 OUT OF 256);". Hive will submit only one file out of the 256 files from the table to the map-reduce job. Zheng On Wed, Oct 28, 2009 at 3:19 PM, Ryan LeCompte <[email protected]> wrote: > Zheng, > > Thanks for your response.. > > > Another question: > > In the example on that wiki page, there is: > > CREATE TABLE user_info_bucketed(userid BIGINT, firstname STRING, lastname > STRING) > COMMENT 'A bucketed copy of user_info' > > PARTITIONED BY(ds STRING) > CLUSTERED BY(userid) INTO 256 BUCKETS; > > > > Is it possible to specify more than one key in the CLUSTERED BY(...) > clause? > > Also, if I am clustering my tables, where/when would I expect to get > improved performance in Hive queries? > > Thanks, > Ryan > > > On Sat, Oct 24, 2009 at 6:56 PM, Zheng Shao <[email protected]> wrote: > >> I think Hive internally uses the file name to guess what the bucket the >> file should belong to. >> >> If your data have the same number of files as buckets of the data, and the >> files are named "part-00000", "part-00001", .., then it will work by just >> loading the files into the table. >> >> This also requires you to know the hashcode of the "bucket key". If the >> "bucket key" is an integer/long, the row with key "x" should belong to >> bucket “x % BUCKETS", so make sure you put that row into "part-0000<y>" >> where y = “x % BUCKETS" >> >> Zheng >> >> >> On Sat, Oct 24, 2009 at 3:00 AM, Ryan LeCompte <[email protected]>wrote: >> >>> Hello, >>> >>> I am trying to create a table that is bucketed and sorted by various >>> columns. My table is created as a sequence file, and I'm populating it with >>> the LOAD DATA command. However, I just came across this wiki page ( >>> http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables) >>> which says that the data will NOT be bucketed when inserted into the table. >>> It gives an example of using the CLUSTER BY command in a SELECT statement to >>> insert the data into the table. >>> >>> Is it possible to somehow get the same effect by using the LOAD DATA >>> command? Or do I have to create a separate bucketed and non-bucketed table >>> for my data and move it around like the example in the link above indicates? >>> >>> Thanks, >>> Ryan >>> >>> >> >> >> -- >> Yours, >> Zheng >> > > -- Yours, Zheng
