+1 to returning directories as context. Very useful feature. Could be used to return context for other adapters (e.g. an adapter that concatenates all versions of versioned logfiles).
+1 making dir an array, per Ted's suggestion I think dir should not appear in *; thus you'd have to write select dir, * from `/mytdir/mysdir/myfile.json` This behavior is analogous to Oracle's ROWID. It is not a column as such, but a system function that you can apply to a row. You need to allow qualifiers: select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as x, `/mytdir/mysdir/myfile2.json` as y and select dir from `/mytdir/mysdir/myfile.json` as x, `/mytdir/mysdir/myfile2.json` as y would be illegal because dir is ambiguous. You should make dir a reserved word (like ROWID). On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning <[email protected]> wrote: > Great point. > > Having the file name itself is very handy. > > > For one thing, I can make a really slow version of [find] ! > > (seriously, I would love this) > > > On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli < > [email protected]> wrote: > >> I am also under the opinion that we should not assume knowledge on the user >> front for data discovery. So we should either have 'dir' columns in 'select >> *' or support a variation that Ted suggested. >> Also the folder names compliment the actual data in some cases. >> >> - Rahul >> >> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <[email protected]> >> wrote: >> >> > Regarding the use case in which the user stores information in pathnames: >> > >> > Since Drill supports that use case partially, shouldn't it do so more >> > completely? In particular, since Drill provides access to subtree >> > pathname segments before the last one (the segments for directories), >> > should Drill provide access to the last one too (the simple file name)? >> > >> > >> > We support reading cases like this: >> > - root/ >> > - root/2015/ >> > - root/2015/01/ >> > - root/2015/01/01/ >> > - root/2015/01/01/log.json >> > - root/2015/02/ >> > - root/2015/02/02/ >> > - root/2015/02/02/log.json >> > >> > In particular, querying "select ... from `root` ..." includes the >> > date-portion segments of the pathnames in the dir0, etc, columns. >> > >> > Note that the user might not redundantly store the dates inside the >> > files themselves, since the dates are known to exist in the directory >> > names. >> > >> > >> > However, we don't support this variation of that case, right?: >> > >> > - root/ >> > - root/2015 >> > - root/2015/01/ >> > - root/2015/01/log_01.json >> > - root/2015/02/ >> > - root/2015/02/log_02.json >> > >> > In particular, Drill includes several segments of the pathname after >> > the root of the subtree, but does not include the last segment--which >> > contains data just as the segments that _are_ included do. >> > >> > (Yes, the last segment usually contains artifacts besides the contained >> > data (e.g., the file extension) and the user would have to specify how >> > to interpret the file simple name segment as data, but the user has to >> > specify the interpretation for the other segments anyway.) >> > >> > >> > Daniel >> > >> > >> > >> > Ted Dunning wrote: >> > >> >> I would propose that dir be an array that contains all of the >> directories >> >> rather than having multiple values. >> >> >> >> The multiple names are particularly inconvenient if files are are >> >> different >> >> depths. >> >> >> >> >> >> >> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]> >> >> wrote: >> >> >> >> I'm specifically arguing that SELECT * doesn't return the columns. >> >>> >> >>> Here is current behavior: >> >>> >> >>> /mytdir/mysdir/myfile.json >> >>> {a:1,b:2,c:3} >> >>> {a:4,b:5,c:6} >> >>> >> >>> select * from `myfile.json` >> >>> >> >>> a, b, c >> >>> 1, 2, 3 >> >>> 4, 5, 6 >> >>> >> >>> select * from `/mysdir/myfile.json` >> >>> >> >>> dir0 a, b, c >> >>> mysdir, 1, 2, 3 >> >>> mysdir, 4, 5, 6 >> >>> >> >>> select * from `/mytdir/mysdir/myfile.json` >> >>> >> >>> dir0, dir1 a, b, c >> >>> mytdir, mysdir, 1, 2, 3 >> >>> mytdir, mysdir, 4, 5, 6 >> >>> >> >>> >> >>> ==================================== >> >>> My proposal: >> >>> >> >>> select * from `myfile.json` >> >>> select * from `/mysdir/myfile.json` >> >>> select * from `/mytdir/mysdir/myfile.json` >> >>> ::all produce:: >> >>> a, b, c >> >>> 1, 2, 3 >> >>> 4, 5, 6 >> >>> >> >>> select dir0, a, b, c from `/mysdir/myfile.json` >> >>> >> >>> dir0 a, b, c >> >>> mysdir, 1, 2, 3 >> >>> mysdir, 4, 5, 6 >> >>> >> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json` >> >>> >> >>> dir0 a, b, c >> >>> mytdir, 1, 2, 3 >> >>> mytdir, 4, 5, 6 >> >>> >> >>> >> >>> >> >>> >> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]> >> wrote: >> >>> >> >>> Seems reasonable, as long as SELECT * also returns the dir# columns. >> >>>> >> >>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <[email protected]> >> >>>> wrote: >> >>>> >> >>>> Hey guys, >> >>>>> >> >>>>> I've been thinking that always showing dir# columns seems to alter >> data >> >>>>> returned from Drill depending on how you select the directory. I'd >> >>>>> >> >>>> propose >> >>>> >> >>>>> that we make it so that we only return dir# columns when they are >> >>>>> explicitly requested. >> >>>>> >> >>>>> Thoughts? >> >>>>> >> >>>>> >> >>>> >> >>> >> >> >> > >> > -- >> > Daniel Barclay >> > MapR Technologies >> > >>
