[
https://issues.apache.org/jira/browse/HIVE-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926991#comment-15926991
]
Eugene Koifman commented on HIVE-16223:
---------------------------------------
+1 we must have well defined (enforced) rules about directory structure/file
names.
> deterministic file naming for bucketing in Hive
> -----------------------------------------------
>
> Key: HIVE-16223
> URL: https://issues.apache.org/jira/browse/HIVE-16223
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
>
> Bucketing in Hive is currently very fragile.
> 1) Some places determine bucket number from file name.
> 2) Some places determine bucket number from a file's "index" in a sorted list
> of files in the directory.
> 3) It is possible to import files into a bucketed table without any regard
> for either.
> On top of that, weird rename paths (like _copy_1), subdirectories (e.g. from
> Tez union, or just tables read with recursive input enabled), repeated
> inserts into the same table, etc. can mess with either scheme.
> Therefore I propose we include bucket index and count explicitly in the file
> name (e.g. 000003_0_bucket_3of32). It will alleviate the above, and also may
> simplify some pieces of code that try to account for missing bucket files,
> multiple files, etc.
> This will require changes to load table logic that is used in ctas, insert,
> load, import etc.; change in logic when getting buckets, as well as when
> altering table bucketing (to rename the files).
> Users will still be able to use old-style bucketing by specifying a
> non-strict config setting (not on by default).
> The conversion of existing tables is the biggest issue. Perhaps the existing
> tables can be "grandfathered" into the non-strict bucketing, with some
> warnings asking the users to convert, and a command to do so in alter
> table/analyze table.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)