I think Hive internally uses the file name to guess what the bucket the file should belong to.
If your data have the same number of files as buckets of the data, and the files are named "part-00000", "part-00001", .., then it will work by just loading the files into the table. This also requires you to know the hashcode of the "bucket key". If the "bucket key" is an integer/long, the row with key "x" should belong to bucket “x % BUCKETS", so make sure you put that row into "part-0000<y>" where y = “x % BUCKETS" Zheng On Sat, Oct 24, 2009 at 3:00 AM, Ryan LeCompte <[email protected]> wrote: > Hello, > > I am trying to create a table that is bucketed and sorted by various > columns. My table is created as a sequence file, and I'm populating it with > the LOAD DATA command. However, I just came across this wiki page ( > http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables) > which says that the data will NOT be bucketed when inserted into the table. > It gives an example of using the CLUSTER BY command in a SELECT statement to > insert the data into the table. > > Is it possible to somehow get the same effect by using the LOAD DATA > command? Or do I have to create a separate bucketed and non-bucketed table > for my data and move it around like the example in the link above indicates? > > Thanks, > Ryan > > -- Yours, Zheng
