Regarding the use case in which the user stores information in pathnames:
Since Drill supports that use case partially, shouldn't it do so more completely? In particular, since Drill provides access to subtree pathname segments before the last one (the segments for directories), should Drill provide access to the last one too (the simple file name)? We support reading cases like this: - root/ - root/2015/ - root/2015/01/ - root/2015/01/01/ - root/2015/01/01/log.json - root/2015/02/ - root/2015/02/02/ - root/2015/02/02/log.json In particular, querying "select ... from `root` ..." includes the date-portion segments of the pathnames in the dir0, etc, columns. Note that the user might not redundantly store the dates inside the files themselves, since the dates are known to exist in the directory names. However, we don't support this variation of that case, right?: - root/ - root/2015 - root/2015/01/ - root/2015/01/log_01.json - root/2015/02/ - root/2015/02/log_02.json In particular, Drill includes several segments of the pathname after the root of the subtree, but does not include the last segment--which contains data just as the segments that _are_ included do. (Yes, the last segment usually contains artifacts besides the contained data (e.g., the file extension) and the user would have to specify how to interpret the file simple name segment as data, but the user has to specify the interpretation for the other segments anyway.) Daniel Ted Dunning wrote:
I would propose that dir be an array that contains all of the directories rather than having multiple values. The multiple names are particularly inconvenient if files are are different depths. On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]> wrote:I'm specifically arguing that SELECT * doesn't return the columns. Here is current behavior: /mytdir/mysdir/myfile.json {a:1,b:2,c:3} {a:4,b:5,c:6} select * from `myfile.json` a, b, c 1, 2, 3 4, 5, 6 select * from `/mysdir/myfile.json` dir0 a, b, c mysdir, 1, 2, 3 mysdir, 4, 5, 6 select * from `/mytdir/mysdir/myfile.json` dir0, dir1 a, b, c mytdir, mysdir, 1, 2, 3 mytdir, mysdir, 4, 5, 6 ==================================== My proposal: select * from `myfile.json` select * from `/mysdir/myfile.json` select * from `/mytdir/mysdir/myfile.json` ::all produce:: a, b, c 1, 2, 3 4, 5, 6 select dir0, a, b, c from `/mysdir/myfile.json` dir0 a, b, c mysdir, 1, 2, 3 mysdir, 4, 5, 6 select dir0, a, b, c from `/mytdir/mysdir/myfile.json` dir0 a, b, c mytdir, 1, 2, 3 mytdir, 4, 5, 6 On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]> wrote:Seems reasonable, as long as SELECT * also returns the dir# columns. On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <[email protected]> wrote:Hey guys, I've been thinking that always showing dir# columns seems to alter data returned from Drill depending on how you select the directory. I'dproposethat we make it so that we only return dir# columns when they are explicitly requested. Thoughts?
-- Daniel Barclay MapR Technologies
