[ 
https://issues.apache.org/jira/browse/HIVE-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926923#comment-15926923
 ] 

Edward Capriolo commented on HIVE-16223:
----------------------------------------

+1. It would be also good if where clauses could target specfic buckets more 
easily WHERE bucket=1of24 or where bucketof(column)=5

> deterministic file naming for bucketing in Hive
> -----------------------------------------------
>
>                 Key: HIVE-16223
>                 URL: https://issues.apache.org/jira/browse/HIVE-16223
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> Bucketing in Hive is currently very fragile.
> 1) Some places determine bucket number from file name.
> 2) Some places determine bucket number from a file's "index" in a sorted list 
> of files in the directory.
> 3) It is possible to import files into a bucketed table without any regard 
> for either.
> On top of that, weird rename paths (like _copy_1), subdirectories (e.g. from 
> Tez union, or just tables read with recursive input enabled), repeated 
> inserts into the same table, etc. can mess with either scheme.
> Therefore I propose we include bucket index and count explicitly in the file 
> name (e.g. 000003_0_bucket_3of32). It will alleviate the above, and also may 
> simplify some pieces of code that try to account for missing bucket files, 
> multiple files, etc.
> This will require changes to load table logic that is used in ctas, insert, 
> load, import etc.; change in logic when getting buckets, as well as when 
> altering table bucketing (to rename the files).
> Users will still be able to use old-style bucketing by specifying a 
> non-strict config setting (not on by default).
> The conversion of existing tables is the biggest issue. Perhaps the existing 
> tables can be "grandfathered" into the non-strict bucketing, with some 
> warnings asking the users to convert, and a command to do so in alter 
> table/analyze table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to