Re: adding filenames as new columns via Hive

Prasad Chakka Wed, 16 Sep 2009 12:14:44 -0700

Yeah, udfs are better semantically. Will this be a problem for partition 
pruning?



________________________________
From: Ashish Thusoo <[email protected]>
Reply-To: <[email protected]>
Date: Wed, 16 Sep 2009 12:11:00 -0700
To: <[email protected]>
Subject: RE: adding filenames as new columns via Hive

You could also do this as a simple udf instead of a virtual column. Virtual 
columns do get shown in the describe command and I don't think it would make 
sense to show this in the describe command. So instead of


Select FILENAME, xyz from T

We could just do

Select Filename(), xyz from T

Thoughts?

Ashish

-----Original Message-----
From: Edward Capriolo [mailto:[email protected]]
Sent: Wednesday, September 16, 2009 12:05 PM
To: [email protected]
Subject: Re: adding filenames as new columns via Hive

I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we also have a 
command like 'show files in partition' so we can inspect what is there as well.


On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <[email protected]> wrote:
> I don't think it is a good idea to make it a part of table metadata in
> any way.
>
> What happens if the filename changes ? It will be very difficult to
> maintain.
>
> But, we can definitely add some virtual columns (FILENAME can be one
> of them
>
> to start with - it should not show up in describe, select * etc.
>
>
>
> But, the user can query based on them - this is mostly for advanced
> users and
>
> can be used for pruning etc. also
>
>
>
>
>
> I will open a new jira, and we can continue the discussion there.
>
>
>
>
>
> -namit
>
>
>
>
>
>
>
>
>
> From: Avram Aelony [mailto:[email protected]]
> Sent: Wednesday, September 16, 2009 11:39 AM
> To: [email protected]
> Subject: RE: adding filenames as new columns via Hive
>
>
>
>
>
> Very cool.  Looking forward to seeing this feature in action. J
>
>
>
> Thanks,
>
> -A
>
>
>
>
>
> From: Prasad Chakka [mailto:[email protected]]
> Sent: Wednesday, September 16, 2009 11:33 AM
> To: [email protected]
> Subject: Re: adding filenames as new columns via Hive
>
>
>
> FYI, all partition columns can be used as any regular columns select
> queries. So it should be fine.
>
> ________________________________
>
> From: Avram Aelony <[email protected]>
> Reply-To: <[email protected]>
> Date: Wed, 16 Sep 2009 11:23:45 -0700
> To: <[email protected]>
> Subject: RE: adding filenames as new columns via Hive
>
> Sounds great, Prasad.
>
> As long as I can further parse the filename field to piece out (new)
> derived fields, I will be happy. J For example, in a later query I'd
> like to be able to do something like:
>
> select
> substr(filename, 4, 7) as  class_A,
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7),
> substr(filename,  8, 10) ;
>
>
> thanks,
> -A
>
>
>
> From: Prasad Chakka [mailto:[email protected]]
> Sent: Wednesday, September 16, 2009 11:10 AM
> To: [email protected]
> Subject: Re: adding filenames as new columns via Hive
>
> I think this can be a good feature though I would like the filename to
> be a partition column (one of such) instead of a separate type of
> column. Would that work?
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> partitioned by (file_name FILENAME)
> stored as textfile location 's3:/somebucket/';
>
> Or table partitioned by datestamp and filename
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> Partitioned by (ds STRING, file_name FILENAME) stored as textfile
> location 's3:/somebucket/';
>
>
> So FILENAME becomes a new type. I like this because partition columns
> are virtual columns just like the filename column and do not exist
> along with data on the disk.
>
> Prasad
>
> ________________________________
>
> From: Avram Aelony <[email protected]>
> Reply-To: <[email protected]>
> Date: Wed, 16 Sep 2009 10:48:33 -0700
> To: <[email protected]>
> Subject: adding filenames as new columns via Hive
>
> Dear Hive list,
>
> I am processing a large volume of files (many files, roughly 500M
> compressed
> ) with Hive that reside in an S3 bucket.  Although the files share the
> same schema,  they have individual filenames that provide useful
> information that does not get captured and does not exist separately
> as a column within each file's data.  As a general problem, I'd like
> to be able to add a new column via Hive that contains the filename of
> the files read in that were present in the bucket.
>
> My Hive CREATE EXTERNAL TABLE command points to the S3 container
> bucket, and I am thinking that at some point Hadoop or Hive must have
> a file handle with the filenames that perhaps could be of use.  My
> hope is that this information could be added in (upon request) via
> Hive.   Perhaps as this could be a new Hive feature request (if it does not 
> currently exist) ??
>
> Ideally, the syntax would look something like this:
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> add_filename as 'filename'
> stored as textfile location 's3:/somebucket/';
>
>
> Has anyone thought of this?  Is there a way to add a new column within
> Hive that contains the filename?
>
>
>
> Many thanks in advance!!
> -Avram
>
>
>
> Avram Aelony
> Senior Analyst, Matching
> eHarmony.com

Re: adding filenames as new columns via Hive

Reply via email to