[ 
https://issues.apache.org/jira/browse/HIVE-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-16223:
------------------------------------
    Description: 
Bucketing in Hive is currently very fragile.
1) Some places determine bucket number from file name.
2) Some places determine bucket number from a file's "index" in a sorted list 
of files in the directory.
3) It is possible to import files into a bucketed table without any regard for 
either.

On top of that, weird rename paths (like _copy_1), subdirectories (e.g. from 
Tez union, or just tables read with recursive input enabled), repeated inserts 
into the same table, etc. can mess with either scheme.

Therefore I propose we include bucket index and count explicitly in the file 
name (e.g. 000003_0_bucket_3of32). It will alleviate the above, and also may 
simplify some pieces of code that try to account for missing bucket files, 
multiple files, etc.
This will require changes to load table logic that is used in ctas, insert, 
load, import etc.; change in logic when getting buckets, as well as when 
altering table bucketing (to rename the files).
Users will still be able to use old-style bucketing by specifying a non-strict 
config setting (not on by default).
The conversion of existing tables is the biggest issue. Perhaps the existing 
tables can be "grandfathered" into the non-strict bucketing, with some warnings 
asking the users to convert, and a command to do so in alter table/analyze 
table.

  was:
Bucketing in Hive is currently very fragile.
1) Some places determine bucket number from file name.
2) Some places determine bucket number from a file's "index" in a sorted list 
of file in the directory.
3) It is possible to import files into a bucketed table without any regard for 
either.

On top of that, weird rename paths (like _copy_1), subdirectories (e.g. from 
Tez union, or just tables read with recursive input enabled), repeated inserts 
into the same table, etc. can mess with either scheme.

Therefore I propose we include bucket index and count explicitly in the file 
name (e.g. 000003_0_bucket_3of32). It will alleviate the above, and also may 
simplify some pieces of code that try to account for missing bucket files, 
multiple files, etc.
This will require changes to load table logic that is used in ctas, insert, 
load, import etc.; change in logic when getting buckets, as well as when 
altering table bucketing (to rename the files).
Users will still be able to use old-style bucketing by specifying a non-strict 
config setting (not on by default).
The conversion of existing tables is the biggest issue. Perhaps the existing 
tables can be "grandfathered" into the non-strict bucketing, with some warnings 
asking the users to convert, and a command to do so in alter table/analyze 
table.


> deterministic file naming for bucketing in Hive
> -----------------------------------------------
>
>                 Key: HIVE-16223
>                 URL: https://issues.apache.org/jira/browse/HIVE-16223
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> Bucketing in Hive is currently very fragile.
> 1) Some places determine bucket number from file name.
> 2) Some places determine bucket number from a file's "index" in a sorted list 
> of files in the directory.
> 3) It is possible to import files into a bucketed table without any regard 
> for either.
> On top of that, weird rename paths (like _copy_1), subdirectories (e.g. from 
> Tez union, or just tables read with recursive input enabled), repeated 
> inserts into the same table, etc. can mess with either scheme.
> Therefore I propose we include bucket index and count explicitly in the file 
> name (e.g. 000003_0_bucket_3of32). It will alleviate the above, and also may 
> simplify some pieces of code that try to account for missing bucket files, 
> multiple files, etc.
> This will require changes to load table logic that is used in ctas, insert, 
> load, import etc.; change in logic when getting buckets, as well as when 
> altering table bucketing (to rename the files).
> Users will still be able to use old-style bucketing by specifying a 
> non-strict config setting (not on by default).
> The conversion of existing tables is the biggest issue. Perhaps the existing 
> tables can be "grandfathered" into the non-strict bucketing, with some 
> warnings asking the users to convert, and a command to do so in alter 
> table/analyze table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to