Zheng, Thanks for your response..
Another question: In the example on that wiki page, there is: CREATE TABLE user_info_bucketed(userid BIGINT, firstname STRING, lastname STRING) COMMENT 'A bucketed copy of user_info' PARTITIONED BY(ds STRING) CLUSTERED BY(userid) INTO 256 BUCKETS; Is it possible to specify more than one key in the CLUSTERED BY(...) clause? Also, if I am clustering my tables, where/when would I expect to get improved performance in Hive queries? Thanks, Ryan On Sat, Oct 24, 2009 at 6:56 PM, Zheng Shao <[email protected]> wrote: > I think Hive internally uses the file name to guess what the bucket the > file should belong to. > > If your data have the same number of files as buckets of the data, and the > files are named "part-00000", "part-00001", .., then it will work by just > loading the files into the table. > > This also requires you to know the hashcode of the "bucket key". If the > "bucket key" is an integer/long, the row with key "x" should belong to > bucket “x % BUCKETS", so make sure you put that row into "part-0000<y>" > where y = “x % BUCKETS" > > Zheng > > > On Sat, Oct 24, 2009 at 3:00 AM, Ryan LeCompte <[email protected]> wrote: > >> Hello, >> >> I am trying to create a table that is bucketed and sorted by various >> columns. My table is created as a sequence file, and I'm populating it with >> the LOAD DATA command. However, I just came across this wiki page ( >> http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables) >> which says that the data will NOT be bucketed when inserted into the table. >> It gives an example of using the CLUSTER BY command in a SELECT statement to >> insert the data into the table. >> >> Is it possible to somehow get the same effect by using the LOAD DATA >> command? Or do I have to create a separate bucketed and non-bucketed table >> for my data and move it around like the example in the link above indicates? >> >> Thanks, >> Ryan >> >> > > > -- > Yours, > Zheng >
