> Ted wrote: > > For one thing, I can make a really slow version of [find] !
Why does it have to be slow? Seriously, so many of the tools we use daily have quasi-query facilities (find, git log, du, ps, netstat) and we cobble together queries using complex options and pipelines of unix commands. Relational algebra is a potentially MORE efficient. I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily and wish I could write ' ... order by count(*) desc'. On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde <[email protected]> wrote: > +1 to returning directories as context. Very useful feature. Could be > used to return context for other adapters (e.g. an adapter that > concatenates all versions of versioned logfiles). > > +1 making dir an array, per Ted's suggestion > > I think dir should not appear in *; thus you'd have to write > > select dir, * from `/mytdir/mysdir/myfile.json` > > This behavior is analogous to Oracle's ROWID. It is not a column as > such, but a system function that you can apply to a row. > > You need to allow qualifiers: > > select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as > x, `/mytdir/mysdir/myfile2.json` as y > > and > > select dir from `/mytdir/mysdir/myfile.json` as x, > `/mytdir/mysdir/myfile2.json` as y > > would be illegal because dir is ambiguous. > > You should make dir a reserved word (like ROWID). > > On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning <[email protected]> wrote: >> Great point. >> >> Having the file name itself is very handy. >> >> >> For one thing, I can make a really slow version of [find] ! >> >> (seriously, I would love this) >> >> >> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli < >> [email protected]> wrote: >> >>> I am also under the opinion that we should not assume knowledge on the user >>> front for data discovery. So we should either have 'dir' columns in 'select >>> *' or support a variation that Ted suggested. >>> Also the folder names compliment the actual data in some cases. >>> >>> - Rahul >>> >>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <[email protected]> >>> wrote: >>> >>> > Regarding the use case in which the user stores information in pathnames: >>> > >>> > Since Drill supports that use case partially, shouldn't it do so more >>> > completely? In particular, since Drill provides access to subtree >>> > pathname segments before the last one (the segments for directories), >>> > should Drill provide access to the last one too (the simple file name)? >>> > >>> > >>> > We support reading cases like this: >>> > - root/ >>> > - root/2015/ >>> > - root/2015/01/ >>> > - root/2015/01/01/ >>> > - root/2015/01/01/log.json >>> > - root/2015/02/ >>> > - root/2015/02/02/ >>> > - root/2015/02/02/log.json >>> > >>> > In particular, querying "select ... from `root` ..." includes the >>> > date-portion segments of the pathnames in the dir0, etc, columns. >>> > >>> > Note that the user might not redundantly store the dates inside the >>> > files themselves, since the dates are known to exist in the directory >>> > names. >>> > >>> > >>> > However, we don't support this variation of that case, right?: >>> > >>> > - root/ >>> > - root/2015 >>> > - root/2015/01/ >>> > - root/2015/01/log_01.json >>> > - root/2015/02/ >>> > - root/2015/02/log_02.json >>> > >>> > In particular, Drill includes several segments of the pathname after >>> > the root of the subtree, but does not include the last segment--which >>> > contains data just as the segments that _are_ included do. >>> > >>> > (Yes, the last segment usually contains artifacts besides the contained >>> > data (e.g., the file extension) and the user would have to specify how >>> > to interpret the file simple name segment as data, but the user has to >>> > specify the interpretation for the other segments anyway.) >>> > >>> > >>> > Daniel >>> > >>> > >>> > >>> > Ted Dunning wrote: >>> > >>> >> I would propose that dir be an array that contains all of the >>> directories >>> >> rather than having multiple values. >>> >> >>> >> The multiple names are particularly inconvenient if files are are >>> >> different >>> >> depths. >>> >> >>> >> >>> >> >>> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]> >>> >> wrote: >>> >> >>> >> I'm specifically arguing that SELECT * doesn't return the columns. >>> >>> >>> >>> Here is current behavior: >>> >>> >>> >>> /mytdir/mysdir/myfile.json >>> >>> {a:1,b:2,c:3} >>> >>> {a:4,b:5,c:6} >>> >>> >>> >>> select * from `myfile.json` >>> >>> >>> >>> a, b, c >>> >>> 1, 2, 3 >>> >>> 4, 5, 6 >>> >>> >>> >>> select * from `/mysdir/myfile.json` >>> >>> >>> >>> dir0 a, b, c >>> >>> mysdir, 1, 2, 3 >>> >>> mysdir, 4, 5, 6 >>> >>> >>> >>> select * from `/mytdir/mysdir/myfile.json` >>> >>> >>> >>> dir0, dir1 a, b, c >>> >>> mytdir, mysdir, 1, 2, 3 >>> >>> mytdir, mysdir, 4, 5, 6 >>> >>> >>> >>> >>> >>> ==================================== >>> >>> My proposal: >>> >>> >>> >>> select * from `myfile.json` >>> >>> select * from `/mysdir/myfile.json` >>> >>> select * from `/mytdir/mysdir/myfile.json` >>> >>> ::all produce:: >>> >>> a, b, c >>> >>> 1, 2, 3 >>> >>> 4, 5, 6 >>> >>> >>> >>> select dir0, a, b, c from `/mysdir/myfile.json` >>> >>> >>> >>> dir0 a, b, c >>> >>> mysdir, 1, 2, 3 >>> >>> mysdir, 4, 5, 6 >>> >>> >>> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json` >>> >>> >>> >>> dir0 a, b, c >>> >>> mytdir, 1, 2, 3 >>> >>> mytdir, 4, 5, 6 >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]> >>> wrote: >>> >>> >>> >>> Seems reasonable, as long as SELECT * also returns the dir# columns. >>> >>>> >>> >>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <[email protected]> >>> >>>> wrote: >>> >>>> >>> >>>> Hey guys, >>> >>>>> >>> >>>>> I've been thinking that always showing dir# columns seems to alter >>> data >>> >>>>> returned from Drill depending on how you select the directory. I'd >>> >>>>> >>> >>>> propose >>> >>>> >>> >>>>> that we make it so that we only return dir# columns when they are >>> >>>>> explicitly requested. >>> >>>>> >>> >>>>> Thoughts? >>> >>>>> >>> >>>>> >>> >>>> >>> >>> >>> >> >>> > >>> > -- >>> > Daniel Barclay >>> > MapR Technologies >>> > >>>
