Allow me to summarize my previous response: +1 adding filename +0 dir array -1 for not including in *
On Thu, Apr 23, 2015 at 7:28 PM, Tomer Shiran <[email protected]> wrote: > +1 to adding the filename (needed this last week, I had <user_id>.json > files and wanted to join with another table) > +1 to using an array dirs[] > +1 to not having it in * (but would "select dirs, *" work?) > > > > > On Apr 23, 2015, at 7:00 PM, Steven Phillips <[email protected]> > wrote: > > > > What you are showing for the current behavior seems wrong to me: > > > > $ tree mytdir > > mytdir > > └── mysdir > > └── myFile.json > > > > $ cat mytdir/mysdir/myFile.json > > {a:1,b:2,c:3} > > {a:4,b:5,c:6} > > > > 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`; > > +------------+------------+------------+ > > | a | b | c | > > +------------+------------+------------+ > > | 1 | 2 | 3 | > > | 4 | 5 | 6 | > > +------------+------------+------------+ > > 2 rows selected (0.274 seconds) > > 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`; > > +------------+------------+------------+ > > | a | b | c | > > +------------+------------+------------+ > > | 1 | 2 | 3 | > > | 4 | 5 | 6 | > > +------------+------------+------------+ > > 2 rows selected (0.152 seconds) > > 0: jdbc:drill:> select * from `/mytdir/mysdir`; > > +------------+------------+------------+ > > | a | b | c | > > +------------+------------+------------+ > > | 1 | 2 | 3 | > > | 4 | 5 | 6 | > > +------------+------------+------------+ > > 2 rows selected (0.157 seconds) > > 0: jdbc:drill:> select * from `mytdir`; > > +------------+------------+------------+------------+ > > | dir0 | a | b | c | > > +------------+------------+------------+------------+ > > | mysdir | 1 | 2 | 3 | > > | mysdir | 4 | 5 | 6 | > > +------------+------------+------------+------------+ > > > > I don't know why in your example, you are getting a dir0 directory when > > selecting a specific file. These directories should only be included when > > the specified table is a directory which contains subdirectories. Any > query > > to a specific file or to a directory that only contains regular files > > should not return dir* columns. > > I think this is the correct behavior. > > > > The fact that `mytidir` and `mytdir/mysdir` have different columns is > not a > > problem, because they are different tables. > > > > I do think Daniel's idea of adding the file name as well makes sense. I'm > > also open to Ted's idea for return a dir array instead of individual > > columns. > > > > On Thu, Apr 23, 2015 at 6:36 PM, Julian Hyde <[email protected]> > wrote: > > > >>> Ted wrote: > >>> > >>> For one thing, I can make a really slow version of [find] ! > >> > >> Why does it have to be slow? Seriously, so many of the tools we use > >> daily have quasi-query facilities (find, git log, du, ps, netstat) and > >> we cobble together queries using complex options and pipelines of unix > >> commands. Relational algebra is a potentially MORE efficient. > >> > >> I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily > >> and wish I could write ' ... order by count(*) desc'. > >> > >>> On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde <[email protected]> > wrote: > >>> +1 to returning directories as context. Very useful feature. Could be > >>> used to return context for other adapters (e.g. an adapter that > >>> concatenates all versions of versioned logfiles). > >>> > >>> +1 making dir an array, per Ted's suggestion > >>> > >>> I think dir should not appear in *; thus you'd have to write > >>> > >>> select dir, * from `/mytdir/mysdir/myfile.json` > >>> > >>> This behavior is analogous to Oracle's ROWID. It is not a column as > >>> such, but a system function that you can apply to a row. > >>> > >>> You need to allow qualifiers: > >>> > >>> select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as > >>> x, `/mytdir/mysdir/myfile2.json` as y > >>> > >>> and > >>> > >>> select dir from `/mytdir/mysdir/myfile.json` as x, > >>> `/mytdir/mysdir/myfile2.json` as y > >>> > >>> would be illegal because dir is ambiguous. > >>> > >>> You should make dir a reserved word (like ROWID). > >>> > >>> On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning <[email protected]> > >> wrote: > >>>> Great point. > >>>> > >>>> Having the file name itself is very handy. > >>>> > >>>> > >>>> For one thing, I can make a really slow version of [find] ! > >>>> > >>>> (seriously, I would love this) > >>>> > >>>> > >>>> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli < > >>>> [email protected]> wrote: > >>>> > >>>>> I am also under the opinion that we should not assume knowledge on > the > >> user > >>>>> front for data discovery. So we should either have 'dir' columns in > >> 'select > >>>>> *' or support a variation that Ted suggested. > >>>>> Also the folder names compliment the actual data in some cases. > >>>>> > >>>>> - Rahul > >>>>> > >>>>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay < > [email protected] > >>> > >>>>> wrote: > >>>>> > >>>>>> Regarding the use case in which the user stores information in > >> pathnames: > >>>>>> > >>>>>> Since Drill supports that use case partially, shouldn't it do so > more > >>>>>> completely? In particular, since Drill provides access to subtree > >>>>>> pathname segments before the last one (the segments for > directories), > >>>>>> should Drill provide access to the last one too (the simple file > >> name)? > >>>>>> > >>>>>> > >>>>>> We support reading cases like this: > >>>>>> - root/ > >>>>>> - root/2015/ > >>>>>> - root/2015/01/ > >>>>>> - root/2015/01/01/ > >>>>>> - root/2015/01/01/log.json > >>>>>> - root/2015/02/ > >>>>>> - root/2015/02/02/ > >>>>>> - root/2015/02/02/log.json > >>>>>> > >>>>>> In particular, querying "select ... from `root` ..." includes the > >>>>>> date-portion segments of the pathnames in the dir0, etc, columns. > >>>>>> > >>>>>> Note that the user might not redundantly store the dates inside the > >>>>>> files themselves, since the dates are known to exist in the > directory > >>>>>> names. > >>>>>> > >>>>>> > >>>>>> However, we don't support this variation of that case, right?: > >>>>>> > >>>>>> - root/ > >>>>>> - root/2015 > >>>>>> - root/2015/01/ > >>>>>> - root/2015/01/log_01.json > >>>>>> - root/2015/02/ > >>>>>> - root/2015/02/log_02.json > >>>>>> > >>>>>> In particular, Drill includes several segments of the pathname after > >>>>>> the root of the subtree, but does not include the last > segment--which > >>>>>> contains data just as the segments that _are_ included do. > >>>>>> > >>>>>> (Yes, the last segment usually contains artifacts besides the > >> contained > >>>>>> data (e.g., the file extension) and the user would have to specify > >> how > >>>>>> to interpret the file simple name segment as data, but the user has > >> to > >>>>>> specify the interpretation for the other segments anyway.) > >>>>>> > >>>>>> > >>>>>> Daniel > >>>>>> > >>>>>> > >>>>>> > >>>>>> Ted Dunning wrote: > >>>>>> > >>>>>>> I would propose that dir be an array that contains all of the > >>>>> directories > >>>>>>> rather than having multiple values. > >>>>>>> > >>>>>>> The multiple names are particularly inconvenient if files are are > >>>>>>> different > >>>>>>> depths. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau < > [email protected] > >>> > >>>>>>> wrote: > >>>>>>> > >>>>>>> I'm specifically arguing that SELECT * doesn't return the columns. > >>>>>>>> > >>>>>>>> Here is current behavior: > >>>>>>>> > >>>>>>>> /mytdir/mysdir/myfile.json > >>>>>>>> {a:1,b:2,c:3} > >>>>>>>> {a:4,b:5,c:6} > >>>>>>>> > >>>>>>>> select * from `myfile.json` > >>>>>>>> > >>>>>>>> a, b, c > >>>>>>>> 1, 2, 3 > >>>>>>>> 4, 5, 6 > >>>>>>>> > >>>>>>>> select * from `/mysdir/myfile.json` > >>>>>>>> > >>>>>>>> dir0 a, b, c > >>>>>>>> mysdir, 1, 2, 3 > >>>>>>>> mysdir, 4, 5, 6 > >>>>>>>> > >>>>>>>> select * from `/mytdir/mysdir/myfile.json` > >>>>>>>> > >>>>>>>> dir0, dir1 a, b, c > >>>>>>>> mytdir, mysdir, 1, 2, 3 > >>>>>>>> mytdir, mysdir, 4, 5, 6 > >>>>>>>> > >>>>>>>> > >>>>>>>> ==================================== > >>>>>>>> My proposal: > >>>>>>>> > >>>>>>>> select * from `myfile.json` > >>>>>>>> select * from `/mysdir/myfile.json` > >>>>>>>> select * from `/mytdir/mysdir/myfile.json` > >>>>>>>> ::all produce:: > >>>>>>>> a, b, c > >>>>>>>> 1, 2, 3 > >>>>>>>> 4, 5, 6 > >>>>>>>> > >>>>>>>> select dir0, a, b, c from `/mysdir/myfile.json` > >>>>>>>> > >>>>>>>> dir0 a, b, c > >>>>>>>> mysdir, 1, 2, 3 > >>>>>>>> mysdir, 4, 5, 6 > >>>>>>>> > >>>>>>>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json` > >>>>>>>> > >>>>>>>> dir0 a, b, c > >>>>>>>> mytdir, 1, 2, 3 > >>>>>>>> mytdir, 4, 5, 6 > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]> > >>>>> wrote: > >>>>>>>> > >>>>>>>> Seems reasonable, as long as SELECT * also returns the dir# > >> columns. > >>>>>>>>> > >>>>>>>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau < > >> [email protected]> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Hey guys, > >>>>>>>>>> > >>>>>>>>>> I've been thinking that always showing dir# columns seems to > >> alter > >>>>> data > >>>>>>>>>> returned from Drill depending on how you select the directory. > >> I'd > >>>>>>>>> propose > >>>>>>>>> > >>>>>>>>>> that we make it so that we only return dir# columns when they > are > >>>>>>>>>> explicitly requested. > >>>>>>>>>> > >>>>>>>>>> Thoughts? > >>>>>> > >>>>>> -- > >>>>>> Daniel Barclay > >>>>>> MapR Technologies > > > > > > > > -- > > Steven Phillips > > Software Engineer > > > > mapr.com > -- Steven Phillips Software Engineer mapr.com
