[ 
https://issues.apache.org/jira/browse/HIVE-22004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Shah reassigned HIVE-22004:
----------------------------------

    Assignee: Aditya Shah

> Non-acid to acid conversion doesn't handle random filenames
> -----------------------------------------------------------
>
>                 Key: HIVE-22004
>                 URL: https://issues.apache.org/jira/browse/HIVE-22004
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>            Reporter: Aditya Shah
>            Assignee: Aditya Shah
>            Priority: Major
>
> Right now the supported filename patterns for non-acid to acid table's files 
> (original files) are the only ones created by Hive itself (eg. 000000, 
> 000000_COPY_1, bucket_00000, etc). But at the same time Hive non-acid table 
> supports reading from tables having files with random filenames. We should 
> support the same for acid tables.
> A way to handle this would be to rename such files and though rename is not a 
> costly operation for HDFS, But for non-acid tables with the location on a 
> blobstore like s3 and having random filenames will have costly added steps to 
> convert to acid.
> Current scenario: What we do now for original files is assign them a logical 
> bucket id and for unrecognized patterns we assign -1 and ignore those files.
> Proposed alternatives:
> 1) For all the random files assume the logical bucket id as 0 and let the 
> files belong to the same bucket in the way similar to we do for multiple 
> files with same bucket id (_copy_N). 
> 2) For all the random files lexicographically sort them and sequentially 
> assign them a bucket id similar to the handling of multiple files for a 
> non-bucketed table where we extract the bucket id simply from filenames



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to