I just put in a related thread about this. This would be really nice. It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.
On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <[email protected]> wrote: > I don’t think it is a good idea to make it a part of table metadata in any > way. > > What happens if the filename changes ? It will be very difficult to > maintain. > > But, we can definitely add some virtual columns (FILENAME can be one of them > > to start with – it should not show up in describe, select * etc. > > > > But, the user can query based on them – this is mostly for advanced users > and > > can be used for pruning etc. also > > > > > > I will open a new jira, and we can continue the discussion there. > > > > > > -namit > > > > > > > > > > From: Avram Aelony [mailto:[email protected]] > Sent: Wednesday, September 16, 2009 11:39 AM > To: [email protected] > Subject: RE: adding filenames as new columns via Hive > > > > > > Very cool. Looking forward to seeing this feature in action… J > > > > Thanks, > > -A > > > > > > From: Prasad Chakka [mailto:[email protected]] > Sent: Wednesday, September 16, 2009 11:33 AM > To: [email protected] > Subject: Re: adding filenames as new columns via Hive > > > > FYI, all partition columns can be used as any regular columns select > queries. So it should be fine. > > ________________________________ > > From: Avram Aelony <[email protected]> > Reply-To: <[email protected]> > Date: Wed, 16 Sep 2009 11:23:45 -0700 > To: <[email protected]> > Subject: RE: adding filenames as new columns via Hive > > Sounds great, Prasad. > > As long as I can further parse the filename field to piece out (new) derived > fields, I will be happy… J > For example, in a later query I’d like to be able to do something like: > > select > substr(filename, 4, 7) as class_A, > substr(filename, 8, 10) as class_B > count( x ) as cnt > from FOO > group by > substr(filename, 4, 7), > substr(filename, 8, 10) ; > > > thanks, > -A > > > > From: Prasad Chakka [mailto:[email protected]] > Sent: Wednesday, September 16, 2009 11:10 AM > To: [email protected] > Subject: Re: adding filenames as new columns via Hive > > I think this can be a good feature though I would like the filename to be a > partition column (one of such) instead of a separate type of column. Would > that work? > > create external table FOO ( <list of fields and types> ) > row format delimited fields terminated by ',' > partitioned by (file_name FILENAME) > stored as textfile location 's3:/somebucket/’; > > Or table partitioned by datestamp and filename > > create external table FOO ( <list of fields and types> ) > row format delimited fields terminated by ',' > Partitioned by (ds STRING, file_name FILENAME) > stored as textfile location 's3:/somebucket/’; > > > So FILENAME becomes a new type. I like this because partition columns are > virtual columns just like the filename column and do not exist along with > data on the disk. > > Prasad > > ________________________________ > > From: Avram Aelony <[email protected]> > Reply-To: <[email protected]> > Date: Wed, 16 Sep 2009 10:48:33 -0700 > To: <[email protected]> > Subject: adding filenames as new columns via Hive > > Dear Hive list, > > I am processing a large volume of files (many files, roughly 500M compressed > ) with Hive that reside in an S3 bucket. Although the files share the same > schema, they have individual filenames that provide useful information that > does not get captured and does not exist separately as a column within each > file’s data. As a general problem, I’d like to be able to add a new column > via Hive that contains the filename of the files read in that were present > in the bucket. > > My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and > I am thinking that at some point Hadoop or Hive must have a file handle with > the filenames that perhaps could be of use. My hope is that this > information could be added in (upon request) via Hive. Perhaps as this > could be a new Hive feature request (if it does not currently exist) ?? > > Ideally, the syntax would look something like this: > > create external table FOO ( <list of fields and types> ) > row format delimited fields terminated by ',' > add_filename as ‘filename’ > stored as textfile location 's3:/somebucket/’; > > > Has anyone thought of this? Is there a way to add a new column within Hive > that contains the filename? > > > > Many thanks in advance!! > -Avram > > > > Avram Aelony > Senior Analyst, Matching > eHarmony.com
