I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we
also have a command like 'show files in partition' so we can inspect
what is there as well.


On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <[email protected]> wrote:
> I don’t think it is a good idea to make it a part of table metadata in any
> way.
>
> What happens if the filename changes ? It will be very difficult to
> maintain.
>
> But, we can definitely add some virtual columns (FILENAME can be one of them
>
> to start with – it should not show up in describe, select * etc.
>
>
>
> But, the user can query based on them – this is mostly for advanced users
> and
>
> can be used for pruning etc. also
>
>
>
>
>
> I will open a new jira, and we can continue the discussion there.
>
>
>
>
>
> -namit
>
>
>
>
>
>
>
>
>
> From: Avram Aelony [mailto:[email protected]]
> Sent: Wednesday, September 16, 2009 11:39 AM
> To: [email protected]
> Subject: RE: adding filenames as new columns via Hive
>
>
>
>
>
> Very cool.  Looking forward to seeing this feature in action… J
>
>
>
> Thanks,
>
> -A
>
>
>
>
>
> From: Prasad Chakka [mailto:[email protected]]
> Sent: Wednesday, September 16, 2009 11:33 AM
> To: [email protected]
> Subject: Re: adding filenames as new columns via Hive
>
>
>
> FYI, all partition columns can be used as any regular columns select
> queries. So it should be fine.
>
> ________________________________
>
> From: Avram Aelony <[email protected]>
> Reply-To: <[email protected]>
> Date: Wed, 16 Sep 2009 11:23:45 -0700
> To: <[email protected]>
> Subject: RE: adding filenames as new columns via Hive
>
> Sounds great, Prasad.
>
> As long as I can further parse the filename field to piece out (new) derived
> fields, I will be happy… J
> For example, in a later query I’d like to be able to do something like:
>
> select
> substr(filename, 4, 7) as  class_A,
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7),
> substr(filename,  8, 10) ;
>
>
> thanks,
> -A
>
>
>
> From: Prasad Chakka [mailto:[email protected]]
> Sent: Wednesday, September 16, 2009 11:10 AM
> To: [email protected]
> Subject: Re: adding filenames as new columns via Hive
>
> I think this can be a good feature though I would like the filename to be a
> partition column (one of such) instead of a separate type of column. Would
> that work?
>
> create external table FOO (  <list of fields and types> )
> row format delimited fields terminated by ','
> partitioned by (file_name FILENAME)
> stored as textfile location 's3:/somebucket/’;
>
> Or table partitioned by datestamp and filename
>
> create external table FOO (  <list of fields and types> )
> row format delimited fields terminated by ','
> Partitioned by (ds STRING, file_name FILENAME)
> stored as textfile location 's3:/somebucket/’;
>
>
> So FILENAME becomes a new type. I like this because partition columns are
> virtual columns just like the filename column and do not exist along with
> data on the disk.
>
> Prasad
>
> ________________________________
>
> From: Avram Aelony <[email protected]>
> Reply-To: <[email protected]>
> Date: Wed, 16 Sep 2009 10:48:33 -0700
> To: <[email protected]>
> Subject: adding filenames as new columns via Hive
>
> Dear Hive list,
>
> I am processing a large volume of files (many files, roughly 500M compressed
> ) with Hive that reside in an S3 bucket.  Although the files share the same
> schema,  they have individual filenames that provide useful information that
> does not get captured and does not exist separately as a column within each
> file’s data.  As a general problem, I’d like to be able to add a new column
> via Hive that contains the filename of the files read in that were present
> in the bucket.
>
> My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and
> I am thinking that at some point Hadoop or Hive must have a file handle with
> the filenames that perhaps could be of use.  My hope is that this
> information could be added in (upon request) via Hive.   Perhaps as this
> could be a new Hive feature request (if it does not currently exist) ??
>
> Ideally, the syntax would look something like this:
>
> create external table FOO (  <list of fields and types> )
> row format delimited fields terminated by ','
> add_filename as ‘filename’
> stored as textfile location 's3:/somebucket/’;
>
>
> Has anyone thought of this?  Is there a way to add a new column within Hive
> that contains the filename?
>
>
>
> Many thanks in advance!!
> -Avram
>
>
>
> Avram Aelony
> Senior Analyst, Matching
> eHarmony.com

Reply via email to