Yeah, udfs are better semantically. Will this be a problem for partition pruning?
________________________________ From: Ashish Thusoo <[email protected]> Reply-To: <[email protected]> Date: Wed, 16 Sep 2009 12:11:00 -0700 To: <[email protected]> Subject: RE: adding filenames as new columns via Hive You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of Select FILENAME, xyz from T We could just do Select Filename(), xyz from T Thoughts? Ashish -----Original Message----- From: Edward Capriolo [mailto:[email protected]] Sent: Wednesday, September 16, 2009 12:05 PM To: [email protected] Subject: Re: adding filenames as new columns via Hive I just put in a related thread about this. This would be really nice. It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well. On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <[email protected]> wrote: > I don't think it is a good idea to make it a part of table metadata in > any way. > > What happens if the filename changes ? It will be very difficult to > maintain. > > But, we can definitely add some virtual columns (FILENAME can be one > of them > > to start with - it should not show up in describe, select * etc. > > > > But, the user can query based on them - this is mostly for advanced > users and > > can be used for pruning etc. also > > > > > > I will open a new jira, and we can continue the discussion there. > > > > > > -namit > > > > > > > > > > From: Avram Aelony [mailto:[email protected]] > Sent: Wednesday, September 16, 2009 11:39 AM > To: [email protected] > Subject: RE: adding filenames as new columns via Hive > > > > > > Very cool. Looking forward to seeing this feature in action. J > > > > Thanks, > > -A > > > > > > From: Prasad Chakka [mailto:[email protected]] > Sent: Wednesday, September 16, 2009 11:33 AM > To: [email protected] > Subject: Re: adding filenames as new columns via Hive > > > > FYI, all partition columns can be used as any regular columns select > queries. So it should be fine. > > ________________________________ > > From: Avram Aelony <[email protected]> > Reply-To: <[email protected]> > Date: Wed, 16 Sep 2009 11:23:45 -0700 > To: <[email protected]> > Subject: RE: adding filenames as new columns via Hive > > Sounds great, Prasad. > > As long as I can further parse the filename field to piece out (new) > derived fields, I will be happy. J For example, in a later query I'd > like to be able to do something like: > > select > substr(filename, 4, 7) as class_A, > substr(filename, 8, 10) as class_B > count( x ) as cnt > from FOO > group by > substr(filename, 4, 7), > substr(filename, 8, 10) ; > > > thanks, > -A > > > > From: Prasad Chakka [mailto:[email protected]] > Sent: Wednesday, September 16, 2009 11:10 AM > To: [email protected] > Subject: Re: adding filenames as new columns via Hive > > I think this can be a good feature though I would like the filename to > be a partition column (one of such) instead of a separate type of > column. Would that work? > > create external table FOO ( <list of fields and types> ) row format > delimited fields terminated by ',' > partitioned by (file_name FILENAME) > stored as textfile location 's3:/somebucket/'; > > Or table partitioned by datestamp and filename > > create external table FOO ( <list of fields and types> ) row format > delimited fields terminated by ',' > Partitioned by (ds STRING, file_name FILENAME) stored as textfile > location 's3:/somebucket/'; > > > So FILENAME becomes a new type. I like this because partition columns > are virtual columns just like the filename column and do not exist > along with data on the disk. > > Prasad > > ________________________________ > > From: Avram Aelony <[email protected]> > Reply-To: <[email protected]> > Date: Wed, 16 Sep 2009 10:48:33 -0700 > To: <[email protected]> > Subject: adding filenames as new columns via Hive > > Dear Hive list, > > I am processing a large volume of files (many files, roughly 500M > compressed > ) with Hive that reside in an S3 bucket. Although the files share the > same schema, they have individual filenames that provide useful > information that does not get captured and does not exist separately > as a column within each file's data. As a general problem, I'd like > to be able to add a new column via Hive that contains the filename of > the files read in that were present in the bucket. > > My Hive CREATE EXTERNAL TABLE command points to the S3 container > bucket, and I am thinking that at some point Hadoop or Hive must have > a file handle with the filenames that perhaps could be of use. My > hope is that this information could be added in (upon request) via > Hive. Perhaps as this could be a new Hive feature request (if it does not > currently exist) ?? > > Ideally, the syntax would look something like this: > > create external table FOO ( <list of fields and types> ) row format > delimited fields terminated by ',' > add_filename as 'filename' > stored as textfile location 's3:/somebucket/'; > > > Has anyone thought of this? Is there a way to add a new column within > Hive that contains the filename? > > > > Many thanks in advance!! > -Avram > > > > Avram Aelony > Senior Analyst, Matching > eHarmony.com
