Re: Should we make dir* columns only exist when requested?

Steven Phillips Thu, 23 Apr 2015 19:34:06 -0700

Allow me to summarize my previous response:

+1 adding filename
+0 dir array
-1 for not including in *


On Thu, Apr 23, 2015 at 7:28 PM, Tomer Shiran <[email protected]> wrote:

> +1 to adding the filename (needed this last week, I had <user_id>.json
> files and wanted to join with another table)
> +1 to using an array dirs[]
> +1 to not having it in * (but would "select dirs, *" work?)
>
>
>
> > On Apr 23, 2015, at 7:00 PM, Steven Phillips <[email protected]>
> wrote:
> >
> > What you are showing for the current behavior seems wrong to me:
> >
> > $ tree mytdir
> > mytdir
> > └── mysdir
> >    └── myFile.json
> >
> > $ cat mytdir/mysdir/myFile.json
> > {a:1,b:2,c:3}
> > {a:4,b:5,c:6}
> >
> > 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
> > +------------+------------+------------+
> > |     a      |     b      |     c      |
> > +------------+------------+------------+
> > | 1          | 2          | 3          |
> > | 4          | 5          | 6          |
> > +------------+------------+------------+
> > 2 rows selected (0.274 seconds)
> > 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
> > +------------+------------+------------+
> > |     a      |     b      |     c      |
> > +------------+------------+------------+
> > | 1          | 2          | 3          |
> > | 4          | 5          | 6          |
> > +------------+------------+------------+
> > 2 rows selected (0.152 seconds)
> > 0: jdbc:drill:> select * from `/mytdir/mysdir`;
> > +------------+------------+------------+
> > |     a      |     b      |     c      |
> > +------------+------------+------------+
> > | 1          | 2          | 3          |
> > | 4          | 5          | 6          |
> > +------------+------------+------------+
> > 2 rows selected (0.157 seconds)
> > 0: jdbc:drill:> select * from `mytdir`;
> > +------------+------------+------------+------------+
> > |    dir0    |     a      |     b      |     c      |
> > +------------+------------+------------+------------+
> > | mysdir     | 1          | 2          | 3          |
> > | mysdir     | 4          | 5          | 6          |
> > +------------+------------+------------+------------+
> >
> > I don't know why in your example, you are getting a dir0 directory when
> > selecting a specific file. These directories should only be included when
> > the specified table is a directory which contains subdirectories. Any
> query
> > to a specific file or to a directory that only contains regular files
> > should not return dir* columns.
> > I think this is the correct behavior.
> >
> > The fact that `mytidir` and `mytdir/mysdir` have different columns is
> not a
> > problem, because they are different tables.
> >
> > I do think Daniel's idea of adding the file name as well makes sense. I'm
> > also open to Ted's idea for return a dir array instead of individual
> > columns.
> >
> > On Thu, Apr 23, 2015 at 6:36 PM, Julian Hyde <[email protected]>
> wrote:
> >
> >>> Ted wrote:
> >>>
> >>> For one thing, I can make a really slow version of [find] !
> >>
> >> Why does it have to be slow? Seriously, so many of the tools we use
> >> daily have quasi-query facilities (find, git log, du, ps, netstat) and
> >> we cobble together queries using complex options and pipelines of unix
> >> commands. Relational algebra is a potentially MORE efficient.
> >>
> >> I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily
> >> and wish I could write ' ... order by count(*) desc'.
> >>
> >>> On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde <[email protected]>
> wrote:
> >>> +1 to returning directories as context. Very useful feature. Could be
> >>> used to return context for other adapters (e.g. an adapter that
> >>> concatenates all versions of versioned logfiles).
> >>>
> >>> +1 making dir an array, per Ted's suggestion
> >>>
> >>> I think dir should not appear in *; thus you'd have to write
> >>>
> >>>  select dir, * from `/mytdir/mysdir/myfile.json`
> >>>
> >>> This behavior is analogous to Oracle's ROWID. It is not a column as
> >>> such, but a system function that you can apply to a row.
> >>>
> >>> You need to allow qualifiers:
> >>>
> >>>  select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
> >>> x, `/mytdir/mysdir/myfile2.json` as y
> >>>
> >>> and
> >>>
> >>>  select dir from `/mytdir/mysdir/myfile.json` as x,
> >>> `/mytdir/mysdir/myfile2.json` as y
> >>>
> >>> would be illegal because dir is ambiguous.
> >>>
> >>> You should make dir a reserved word (like ROWID).
> >>>
> >>> On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning <[email protected]>
> >> wrote:
> >>>> Great point.
> >>>>
> >>>> Having the file name itself is very handy.
> >>>>
> >>>>
> >>>> For one thing, I can make a really slow version of [find] !
> >>>>
> >>>> (seriously, I would love this)
> >>>>
> >>>>
> >>>> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> I am also under the opinion that we should not assume knowledge on
> the
> >> user
> >>>>> front for data discovery. So we should either have 'dir' columns in
> >> 'select
> >>>>> *' or support a variation that Ted suggested.
> >>>>> Also the folder names compliment the actual data in some cases.
> >>>>>
> >>>>> - Rahul
> >>>>>
> >>>>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <
> [email protected]
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Regarding the use case in which the user stores information in
> >> pathnames:
> >>>>>>
> >>>>>> Since Drill supports that use case partially, shouldn't it do so
> more
> >>>>>> completely?  In particular, since Drill provides access to subtree
> >>>>>> pathname segments before the last one (the segments for
> directories),
> >>>>>> should Drill provide access to the last one too (the simple file
> >> name)?
> >>>>>>
> >>>>>>
> >>>>>> We support reading cases like this:
> >>>>>> - root/
> >>>>>> - root/2015/
> >>>>>> - root/2015/01/
> >>>>>> - root/2015/01/01/
> >>>>>> - root/2015/01/01/log.json
> >>>>>> - root/2015/02/
> >>>>>> - root/2015/02/02/
> >>>>>> - root/2015/02/02/log.json
> >>>>>>
> >>>>>> In particular, querying "select ... from `root` ..." includes the
> >>>>>> date-portion segments of the pathnames in the dir0, etc, columns.
> >>>>>>
> >>>>>> Note that the user might not redundantly store the dates inside the
> >>>>>> files themselves, since the dates are known to exist in the
> directory
> >>>>>> names.
> >>>>>>
> >>>>>>
> >>>>>> However, we don't support this variation of that case, right?:
> >>>>>>
> >>>>>> - root/
> >>>>>> - root/2015
> >>>>>> - root/2015/01/
> >>>>>> - root/2015/01/log_01.json
> >>>>>> - root/2015/02/
> >>>>>> - root/2015/02/log_02.json
> >>>>>>
> >>>>>> In particular, Drill includes several segments of the pathname after
> >>>>>> the root of the subtree, but does not include the last
> segment--which
> >>>>>> contains data just as the segments that _are_ included do.
> >>>>>>
> >>>>>> (Yes, the last segment usually contains artifacts besides the
> >> contained
> >>>>>> data (e.g., the file extension) and the user would have to specify
> >> how
> >>>>>> to interpret the file simple name segment as data, but the user has
> >> to
> >>>>>> specify the interpretation for the other segments anyway.)
> >>>>>>
> >>>>>>
> >>>>>> Daniel
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Ted Dunning wrote:
> >>>>>>
> >>>>>>> I would propose that dir be an array that contains all of the
> >>>>> directories
> >>>>>>> rather than having multiple values.
> >>>>>>>
> >>>>>>> The multiple names are particularly inconvenient if files are are
> >>>>>>> different
> >>>>>>> depths.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <
> [email protected]
> >>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> I'm specifically arguing that SELECT * doesn't return the columns.
> >>>>>>>>
> >>>>>>>> Here is current behavior:
> >>>>>>>>
> >>>>>>>> /mytdir/mysdir/myfile.json
> >>>>>>>> {a:1,b:2,c:3}
> >>>>>>>> {a:4,b:5,c:6}
> >>>>>>>>
> >>>>>>>> select * from `myfile.json`
> >>>>>>>>
> >>>>>>>> a, b, c
> >>>>>>>> 1, 2, 3
> >>>>>>>> 4, 5, 6
> >>>>>>>>
> >>>>>>>> select * from `/mysdir/myfile.json`
> >>>>>>>>
> >>>>>>>> dir0 a, b, c
> >>>>>>>> mysdir, 1, 2, 3
> >>>>>>>> mysdir, 4, 5, 6
> >>>>>>>>
> >>>>>>>> select * from `/mytdir/mysdir/myfile.json`
> >>>>>>>>
> >>>>>>>> dir0, dir1 a, b, c
> >>>>>>>> mytdir, mysdir, 1, 2, 3
> >>>>>>>> mytdir, mysdir, 4, 5, 6
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ====================================
> >>>>>>>> My proposal:
> >>>>>>>>
> >>>>>>>> select * from `myfile.json`
> >>>>>>>> select * from `/mysdir/myfile.json`
> >>>>>>>> select * from `/mytdir/mysdir/myfile.json`
> >>>>>>>> ::all produce::
> >>>>>>>> a, b, c
> >>>>>>>> 1, 2, 3
> >>>>>>>> 4, 5, 6
> >>>>>>>>
> >>>>>>>> select dir0, a, b, c from `/mysdir/myfile.json`
> >>>>>>>>
> >>>>>>>> dir0 a, b, c
> >>>>>>>> mysdir, 1, 2, 3
> >>>>>>>> mysdir, 4, 5, 6
> >>>>>>>>
> >>>>>>>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
> >>>>>>>>
> >>>>>>>> dir0 a, b, c
> >>>>>>>> mytdir, 1, 2, 3
> >>>>>>>> mytdir, 4, 5, 6
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> Seems reasonable, as long as SELECT * also returns the dir#
> >> columns.
> >>>>>>>>>
> >>>>>>>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <
> >> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hey guys,
> >>>>>>>>>>
> >>>>>>>>>> I've been thinking that always showing dir# columns seems to
> >> alter
> >>>>> data
> >>>>>>>>>> returned from Drill depending on how you select the directory.
> >> I'd
> >>>>>>>>> propose
> >>>>>>>>>
> >>>>>>>>>> that we make it so that we only return dir# columns when they
> are
> >>>>>>>>>> explicitly requested.
> >>>>>>>>>>
> >>>>>>>>>> Thoughts?
> >>>>>>
> >>>>>> --
> >>>>>> Daniel Barclay
> >>>>>> MapR Technologies
> >
> >
> >
> > --
> > Steven Phillips
> > Software Engineer
> >
> > mapr.com
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Should we make dir* columns only exist when requested?

Reply via email to