[ 
https://issues.apache.org/jira/browse/HIVE-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782529#action_12782529
 ] 

Avram Aelony commented on HIVE-951:
-----------------------------------

I think the filename can contain important information (e.g. datestamp, name of 
the type of data it represents, etc...) that it is desirable to be able to 
parse out and then group by.   

Imagine a few year's worth of data where there are 4 or more filetypes (each 
filetype having a different set of columns) output to a bucket every day (e.g. 
20091125_type_A.gz, 20091125_type_B.gz, 20091125_type_C.gz, 
20091125_type_D.gz).  In fact, each day can contain 20 or more large files per 
filetype (e.g. 20091125_type_A_01.gz, 20091125_type_A_02.gz, 
20091125_type_A_03.gz, ..., 20091125_type_A_20.gz, repeat for B,C,D, etc... ). 

It would be nice to be able to parse out new variables for date, type, and 
type_number (e.g. 01, 02, ..., 20 ) and be able to compute various aggregated 
metrics via a group by of these variables parsed from the filenames. Hopefully 
this parsing out would not be too much of a performance bottleneck..(?)

So, I think there is a need both for a way to select certain files that match a 
regex from an S3 bucket, and also a need for capturing filename information 
such that it can subsequently be available for parsing and grouping.  It may be 
possible to achieve both needs in one use case, but I don't know enough about 
Hive/Hadoop internals to judge myself.

> Selectively include EXTERNAL TABLE source files via REGEX
> ---------------------------------------------------------
>
>                 Key: HIVE-951
>                 URL: https://issues.apache.org/jira/browse/HIVE-951
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Carl Steinbach
>
> CREATE EXTERNAL TABLE should allow users to cherry-pick files via regular 
> expression. 
> CREATE EXTERNAL TABLE was designed to allow users to access data that exists 
> outside of Hive, and
> currently makes the assumption that all of the files located under the 
> supplied path should be included
> in the new table. Users frequently encounter directories containing multiple
> datasets, or directories that contain data in heterogeneous schemas, and it's 
> often
> impractical or impossible to adjust the layout of the directory to meet the 
> requirements of 
> CREATE EXTERNAL TABLE. A good example of this problem is creating an external 
> table based
> on the contents of an S3 bucket. 
> One way to solve this problem is to extend the syntax of CREATE EXTERNAL TABLE
> as follows:
> CREATE EXTERNAL TABLE
> ...
> LOCATION path [file_regex]
> ...
> For example:
> {code:sql}
> CREATE EXTERNAL TABLE mytable1 ( a string, b string, c string )
> STORED AS TEXTFILE
> LOCATION 's3://my.bucket/' 'folder/2009.*\.bz2$';
> {code}
> Creates mytable1 which includes all files in s3:/my.bucket with a filename 
> matching 'folder/2009*.bz2'
> {code:sql}
> CREATE EXTERNAL TABLE mytable2 ( d string, e int, f int, g int )
> STORED AS TEXTFILE 
> LOCATION 'hdfs://data/' 'xyz.*2009????.bz2$';
> {code}
> Creates mytable2 including all files matching 'xyz*2009????.bz2' located 
> under hdfs://data/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to