To Parth's question,

1) SELECT * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`

dir[i] would not returned.

2) SELECT dir0, * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`
dir0 would be returned.

3) SELECT dir, * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`
the array of dir would be returned.

If user does not explicitly ask for those special field (dir), why do we
always include them in the result by default?  What if user does not want
to have those field ? Is there an easy way to allow the user to express the
semantics that they do not want those fields?

To me, it makes more sense that * means the regular fields in the
file/table, and dir are special fields which are included in the result
only when user explicitly asks for them.




On Thu, Apr 23, 2015 at 5:01 PM, Parth Chandra <[email protected]>
wrote:

> A common use case (as Daniel's example pointed out) is to arrange data in
> directories by date and look for the newest date.
>
> Something like this:
>
> Directory structure -
>
>   2015-04-01/subdir/data.json
>   2015-04-02/subdir/data.json
>   2015-04-03/subdir/data.json
>   .
>   .
>
> Then query for the latest data available
>
> SELECT * FROM `*/subdir/data.json` WHERE `dir0` IN (SELECT MAX(`dir0`) FROM
> `*/subdir` )
>
> or even -
>
> SELECT * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`
>
>
> Would dir[i] be returned in this query?
>
>
>
>
>
>
> On Thu, Apr 23, 2015 at 4:48 PM, rahul challapalli <
> [email protected]> wrote:
>
> > I am also under the opinion that we should not assume knowledge on the
> user
> > front for data discovery. So we should either have 'dir' columns in
> 'select
> > *' or support a variation that Ted suggested.
> > Also the folder names compliment the actual data in some cases.
> >
> > - Rahul
> >
> > On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <[email protected]>
> > wrote:
> >
> > > Regarding the use case in which the user stores information in
> pathnames:
> > >
> > > Since Drill supports that use case partially, shouldn't it do so more
> > > completely?  In particular, since Drill provides access to subtree
> > > pathname segments before the last one (the segments for directories),
> > > should Drill provide access to the last one too (the simple file name)?
> > >
> > >
> > > We support reading cases like this:
> > > - root/
> > > - root/2015/
> > > - root/2015/01/
> > > - root/2015/01/01/
> > > - root/2015/01/01/log.json
> > > - root/2015/02/
> > > - root/2015/02/02/
> > > - root/2015/02/02/log.json
> > >
> > > In particular, querying "select ... from `root` ..." includes the
> > > date-portion segments of the pathnames in the dir0, etc, columns.
> > >
> > > Note that the user might not redundantly store the dates inside the
> > > files themselves, since the dates are known to exist in the directory
> > > names.
> > >
> > >
> > > However, we don't support this variation of that case, right?:
> > >
> > > - root/
> > > - root/2015
> > > - root/2015/01/
> > > - root/2015/01/log_01.json
> > > - root/2015/02/
> > > - root/2015/02/log_02.json
> > >
> > > In particular, Drill includes several segments of the pathname after
> > > the root of the subtree, but does not include the last segment--which
> > > contains data just as the segments that _are_ included do.
> > >
> > > (Yes, the last segment usually contains artifacts besides the contained
> > > data (e.g., the file extension) and the user would have to specify how
> > > to interpret the file simple name segment as data, but the user has to
> > > specify the interpretation for the other segments anyway.)
> > >
> > >
> > > Daniel
> > >
> > >
> > >
> > > Ted Dunning wrote:
> > >
> > >> I would propose that dir be an array that contains all of the
> > directories
> > >> rather than having multiple values.
> > >>
> > >> The multiple names are particularly inconvenient if files are are
> > >> different
> > >> depths.
> > >>
> > >>
> > >>
> > >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]>
> > >> wrote:
> > >>
> > >>  I'm specifically arguing that SELECT * doesn't return the columns.
> > >>>
> > >>> Here is current behavior:
> > >>>
> > >>> /mytdir/mysdir/myfile.json
> > >>> {a:1,b:2,c:3}
> > >>> {a:4,b:5,c:6}
> > >>>
> > >>> select * from `myfile.json`
> > >>>
> > >>> a, b, c
> > >>> 1, 2, 3
> > >>> 4, 5, 6
> > >>>
> > >>> select * from `/mysdir/myfile.json`
> > >>>
> > >>> dir0 a, b, c
> > >>> mysdir, 1, 2, 3
> > >>> mysdir, 4, 5, 6
> > >>>
> > >>> select * from `/mytdir/mysdir/myfile.json`
> > >>>
> > >>> dir0, dir1 a, b, c
> > >>> mytdir, mysdir, 1, 2, 3
> > >>> mytdir, mysdir, 4, 5, 6
> > >>>
> > >>>
> > >>> ====================================
> > >>> My proposal:
> > >>>
> > >>> select * from `myfile.json`
> > >>> select * from `/mysdir/myfile.json`
> > >>> select * from `/mytdir/mysdir/myfile.json`
> > >>> ::all produce::
> > >>> a, b, c
> > >>> 1, 2, 3
> > >>> 4, 5, 6
> > >>>
> > >>> select dir0, a, b, c from `/mysdir/myfile.json`
> > >>>
> > >>> dir0 a, b, c
> > >>> mysdir, 1, 2, 3
> > >>> mysdir, 4, 5, 6
> > >>>
> > >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
> > >>>
> > >>> dir0 a, b, c
> > >>> mytdir, 1, 2, 3
> > >>> mytdir, 4, 5, 6
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]>
> > wrote:
> > >>>
> > >>>  Seems reasonable, as long as SELECT * also returns the dir# columns.
> > >>>>
> > >>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <[email protected]
> >
> > >>>> wrote:
> > >>>>
> > >>>>  Hey guys,
> > >>>>>
> > >>>>> I've been thinking that always showing dir# columns seems to alter
> > data
> > >>>>> returned from Drill depending on how you select the directory.  I'd
> > >>>>>
> > >>>> propose
> > >>>>
> > >>>>> that we make it so that we only return dir# columns when they are
> > >>>>> explicitly requested.
> > >>>>>
> > >>>>> Thoughts?
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> > > --
> > > Daniel Barclay
> > > MapR Technologies
> > >
> >
>

Reply via email to