+1 to adding the filename (needed this last week, I had <user_id>.json files and wanted to join with another table) +1 to using an array dirs[] +1 to not having it in * (but would "select dirs, *" work?)
> On Apr 23, 2015, at 7:00 PM, Steven Phillips <[email protected]> wrote: > > What you are showing for the current behavior seems wrong to me: > > $ tree mytdir > mytdir > └── mysdir > └── myFile.json > > $ cat mytdir/mysdir/myFile.json > {a:1,b:2,c:3} > {a:4,b:5,c:6} > > 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`; > +------------+------------+------------+ > | a | b | c | > +------------+------------+------------+ > | 1 | 2 | 3 | > | 4 | 5 | 6 | > +------------+------------+------------+ > 2 rows selected (0.274 seconds) > 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`; > +------------+------------+------------+ > | a | b | c | > +------------+------------+------------+ > | 1 | 2 | 3 | > | 4 | 5 | 6 | > +------------+------------+------------+ > 2 rows selected (0.152 seconds) > 0: jdbc:drill:> select * from `/mytdir/mysdir`; > +------------+------------+------------+ > | a | b | c | > +------------+------------+------------+ > | 1 | 2 | 3 | > | 4 | 5 | 6 | > +------------+------------+------------+ > 2 rows selected (0.157 seconds) > 0: jdbc:drill:> select * from `mytdir`; > +------------+------------+------------+------------+ > | dir0 | a | b | c | > +------------+------------+------------+------------+ > | mysdir | 1 | 2 | 3 | > | mysdir | 4 | 5 | 6 | > +------------+------------+------------+------------+ > > I don't know why in your example, you are getting a dir0 directory when > selecting a specific file. These directories should only be included when > the specified table is a directory which contains subdirectories. Any query > to a specific file or to a directory that only contains regular files > should not return dir* columns. > I think this is the correct behavior. > > The fact that `mytidir` and `mytdir/mysdir` have different columns is not a > problem, because they are different tables. > > I do think Daniel's idea of adding the file name as well makes sense. I'm > also open to Ted's idea for return a dir array instead of individual > columns. > > On Thu, Apr 23, 2015 at 6:36 PM, Julian Hyde <[email protected]> wrote: > >>> Ted wrote: >>> >>> For one thing, I can make a really slow version of [find] ! >> >> Why does it have to be slow? Seriously, so many of the tools we use >> daily have quasi-query facilities (find, git log, du, ps, netstat) and >> we cobble together queries using complex options and pipelines of unix >> commands. Relational algebra is a potentially MORE efficient. >> >> I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily >> and wish I could write ' ... order by count(*) desc'. >> >>> On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde <[email protected]> wrote: >>> +1 to returning directories as context. Very useful feature. Could be >>> used to return context for other adapters (e.g. an adapter that >>> concatenates all versions of versioned logfiles). >>> >>> +1 making dir an array, per Ted's suggestion >>> >>> I think dir should not appear in *; thus you'd have to write >>> >>> select dir, * from `/mytdir/mysdir/myfile.json` >>> >>> This behavior is analogous to Oracle's ROWID. It is not a column as >>> such, but a system function that you can apply to a row. >>> >>> You need to allow qualifiers: >>> >>> select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as >>> x, `/mytdir/mysdir/myfile2.json` as y >>> >>> and >>> >>> select dir from `/mytdir/mysdir/myfile.json` as x, >>> `/mytdir/mysdir/myfile2.json` as y >>> >>> would be illegal because dir is ambiguous. >>> >>> You should make dir a reserved word (like ROWID). >>> >>> On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning <[email protected]> >> wrote: >>>> Great point. >>>> >>>> Having the file name itself is very handy. >>>> >>>> >>>> For one thing, I can make a really slow version of [find] ! >>>> >>>> (seriously, I would love this) >>>> >>>> >>>> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli < >>>> [email protected]> wrote: >>>> >>>>> I am also under the opinion that we should not assume knowledge on the >> user >>>>> front for data discovery. So we should either have 'dir' columns in >> 'select >>>>> *' or support a variation that Ted suggested. >>>>> Also the folder names compliment the actual data in some cases. >>>>> >>>>> - Rahul >>>>> >>>>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <[email protected] >>> >>>>> wrote: >>>>> >>>>>> Regarding the use case in which the user stores information in >> pathnames: >>>>>> >>>>>> Since Drill supports that use case partially, shouldn't it do so more >>>>>> completely? In particular, since Drill provides access to subtree >>>>>> pathname segments before the last one (the segments for directories), >>>>>> should Drill provide access to the last one too (the simple file >> name)? >>>>>> >>>>>> >>>>>> We support reading cases like this: >>>>>> - root/ >>>>>> - root/2015/ >>>>>> - root/2015/01/ >>>>>> - root/2015/01/01/ >>>>>> - root/2015/01/01/log.json >>>>>> - root/2015/02/ >>>>>> - root/2015/02/02/ >>>>>> - root/2015/02/02/log.json >>>>>> >>>>>> In particular, querying "select ... from `root` ..." includes the >>>>>> date-portion segments of the pathnames in the dir0, etc, columns. >>>>>> >>>>>> Note that the user might not redundantly store the dates inside the >>>>>> files themselves, since the dates are known to exist in the directory >>>>>> names. >>>>>> >>>>>> >>>>>> However, we don't support this variation of that case, right?: >>>>>> >>>>>> - root/ >>>>>> - root/2015 >>>>>> - root/2015/01/ >>>>>> - root/2015/01/log_01.json >>>>>> - root/2015/02/ >>>>>> - root/2015/02/log_02.json >>>>>> >>>>>> In particular, Drill includes several segments of the pathname after >>>>>> the root of the subtree, but does not include the last segment--which >>>>>> contains data just as the segments that _are_ included do. >>>>>> >>>>>> (Yes, the last segment usually contains artifacts besides the >> contained >>>>>> data (e.g., the file extension) and the user would have to specify >> how >>>>>> to interpret the file simple name segment as data, but the user has >> to >>>>>> specify the interpretation for the other segments anyway.) >>>>>> >>>>>> >>>>>> Daniel >>>>>> >>>>>> >>>>>> >>>>>> Ted Dunning wrote: >>>>>> >>>>>>> I would propose that dir be an array that contains all of the >>>>> directories >>>>>>> rather than having multiple values. >>>>>>> >>>>>>> The multiple names are particularly inconvenient if files are are >>>>>>> different >>>>>>> depths. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected] >>> >>>>>>> wrote: >>>>>>> >>>>>>> I'm specifically arguing that SELECT * doesn't return the columns. >>>>>>>> >>>>>>>> Here is current behavior: >>>>>>>> >>>>>>>> /mytdir/mysdir/myfile.json >>>>>>>> {a:1,b:2,c:3} >>>>>>>> {a:4,b:5,c:6} >>>>>>>> >>>>>>>> select * from `myfile.json` >>>>>>>> >>>>>>>> a, b, c >>>>>>>> 1, 2, 3 >>>>>>>> 4, 5, 6 >>>>>>>> >>>>>>>> select * from `/mysdir/myfile.json` >>>>>>>> >>>>>>>> dir0 a, b, c >>>>>>>> mysdir, 1, 2, 3 >>>>>>>> mysdir, 4, 5, 6 >>>>>>>> >>>>>>>> select * from `/mytdir/mysdir/myfile.json` >>>>>>>> >>>>>>>> dir0, dir1 a, b, c >>>>>>>> mytdir, mysdir, 1, 2, 3 >>>>>>>> mytdir, mysdir, 4, 5, 6 >>>>>>>> >>>>>>>> >>>>>>>> ==================================== >>>>>>>> My proposal: >>>>>>>> >>>>>>>> select * from `myfile.json` >>>>>>>> select * from `/mysdir/myfile.json` >>>>>>>> select * from `/mytdir/mysdir/myfile.json` >>>>>>>> ::all produce:: >>>>>>>> a, b, c >>>>>>>> 1, 2, 3 >>>>>>>> 4, 5, 6 >>>>>>>> >>>>>>>> select dir0, a, b, c from `/mysdir/myfile.json` >>>>>>>> >>>>>>>> dir0 a, b, c >>>>>>>> mysdir, 1, 2, 3 >>>>>>>> mysdir, 4, 5, 6 >>>>>>>> >>>>>>>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json` >>>>>>>> >>>>>>>> dir0 a, b, c >>>>>>>> mytdir, 1, 2, 3 >>>>>>>> mytdir, 4, 5, 6 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]> >>>>> wrote: >>>>>>>> >>>>>>>> Seems reasonable, as long as SELECT * also returns the dir# >> columns. >>>>>>>>> >>>>>>>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau < >> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hey guys, >>>>>>>>>> >>>>>>>>>> I've been thinking that always showing dir# columns seems to >> alter >>>>> data >>>>>>>>>> returned from Drill depending on how you select the directory. >> I'd >>>>>>>>> propose >>>>>>>>> >>>>>>>>>> that we make it so that we only return dir# columns when they are >>>>>>>>>> explicitly requested. >>>>>>>>>> >>>>>>>>>> Thoughts? >>>>>> >>>>>> -- >>>>>> Daniel Barclay >>>>>> MapR Technologies > > > > -- > Steven Phillips > Software Engineer > > mapr.com
