Hi Edward, Nice explanation. Can you pl describe ur use-case in the comments for HADOOP-4044. It will help in making the case for this JIRA to get into trunk sooner rather than later.
thanks, dhruba On Mon, Apr 20, 2009 at 7:59 AM, Edward Capriolo <[email protected]>wrote: > Actually, I am working to get the files moved into the warehouse by > default :), but I still think there might be a general need for this. > > External tables will work in some cases but not in others. For example > suppose a directory inside hadoop: > /user/edward/weblogs/{web1.log,web2.log,web3.log}. I can use EXTERNAL > to point to the parent directory. This will work unless a process > creates another file in this directory with a different format that > holds different data. say web_logsummary.csv. (this is my case) > > Being able to drop in a 'symlink' where a file would go could be used > like an SQL VIEW. Or could be used to create structures from already > existing data. Imagine a user that has a large hadoop deployment and > wishing to migrate/ start using hive. They would need to recode > application paths because external table is nice but not very > flexible. If you had a 'symlink' concept anyone can start using hive > without re-organizing or copying data. > > In the end managing the 'symlinks' could get cumbersome, but I think > its a powerful concept. Right now hive has a lot of facilities to deal > with all input formats, such as specifying delimiters etc, that is > super helpful, but forcing the data either into a warehouse or into an > external table is limiting. > > On Mon, Apr 20, 2009 at 5:29 AM, Jeff Hammerbacher <[email protected]> > wrote: > > Hey Edward, > > > > Can you just treat the files as external tables? > > > > Later, > > Jeff > > > > On Sun, Apr 19, 2009 at 8:24 AM, Edward Capriolo <[email protected] > >wrote: > > > >> On Sun, Apr 19, 2009 at 3:19 AM, Dhruba Borthakur <[email protected]> > >> wrote: > >> > HADOOP-4044 is scheduled to finally make it to 0.21 release. And 0.21 > is > >> > still a while away. > >> > > >> > That said, if one imports a data-set (set of files, or directory) into > a > >> > warehouse, isn't it safer to move that dataset into the warehouse > itself > >> > rather than letting it sit outside. For one thing, the target of the > >> symlink > >> > might not be accessible to all hadoop slave nodes. > >> > > >> > -dhruba > >> > > >> > > >> > On Sat, Apr 18, 2009 at 7:41 PM, Edward Capriolo < > [email protected] > >> >wrote: > >> > > >> >> I was looking at HADOOP-4044. It would be nice to be able to work on > >> >> files without moving them into the warehouse. Could a SerDe handle a > >> >> similar task? > >> >> > >> > > >> > >> Yes it would be safer to move it inside. > >> > >> The reason I would like to do this is in our deployment map reduce > >> programs are creating files outside of the warehouse. I do not want to > >> move them into the warehouse and I do not want to copy them. Being > >> able to 'symlink' would allow me to assemble virtual tables/ without > >> moving data changing the flow of an already existing process. > >> > >> So I am only looking to symlink to other files in the same filesystem. > >> On the extreme end a symlink to an external resource could be very > >> useful to but that is not what I was thinking of. > >> > > >
