A common use case (as Daniel's example pointed out) is to arrange data in
directories by date and look for the newest date.

Something like this:

Directory structure -

  2015-04-01/subdir/data.json
  2015-04-02/subdir/data.json
  2015-04-03/subdir/data.json
  .
  .

Then query for the latest data available

SELECT * FROM `*/subdir/data.json` WHERE `dir0` IN (SELECT MAX(`dir0`) FROM
`*/subdir` )

or even -

SELECT * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`


Would dir[i] be returned in this query?






On Thu, Apr 23, 2015 at 4:48 PM, rahul challapalli <
[email protected]> wrote:

> I am also under the opinion that we should not assume knowledge on the user
> front for data discovery. So we should either have 'dir' columns in 'select
> *' or support a variation that Ted suggested.
> Also the folder names compliment the actual data in some cases.
>
> - Rahul
>
> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <[email protected]>
> wrote:
>
> > Regarding the use case in which the user stores information in pathnames:
> >
> > Since Drill supports that use case partially, shouldn't it do so more
> > completely?  In particular, since Drill provides access to subtree
> > pathname segments before the last one (the segments for directories),
> > should Drill provide access to the last one too (the simple file name)?
> >
> >
> > We support reading cases like this:
> > - root/
> > - root/2015/
> > - root/2015/01/
> > - root/2015/01/01/
> > - root/2015/01/01/log.json
> > - root/2015/02/
> > - root/2015/02/02/
> > - root/2015/02/02/log.json
> >
> > In particular, querying "select ... from `root` ..." includes the
> > date-portion segments of the pathnames in the dir0, etc, columns.
> >
> > Note that the user might not redundantly store the dates inside the
> > files themselves, since the dates are known to exist in the directory
> > names.
> >
> >
> > However, we don't support this variation of that case, right?:
> >
> > - root/
> > - root/2015
> > - root/2015/01/
> > - root/2015/01/log_01.json
> > - root/2015/02/
> > - root/2015/02/log_02.json
> >
> > In particular, Drill includes several segments of the pathname after
> > the root of the subtree, but does not include the last segment--which
> > contains data just as the segments that _are_ included do.
> >
> > (Yes, the last segment usually contains artifacts besides the contained
> > data (e.g., the file extension) and the user would have to specify how
> > to interpret the file simple name segment as data, but the user has to
> > specify the interpretation for the other segments anyway.)
> >
> >
> > Daniel
> >
> >
> >
> > Ted Dunning wrote:
> >
> >> I would propose that dir be an array that contains all of the
> directories
> >> rather than having multiple values.
> >>
> >> The multiple names are particularly inconvenient if files are are
> >> different
> >> depths.
> >>
> >>
> >>
> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]>
> >> wrote:
> >>
> >>  I'm specifically arguing that SELECT * doesn't return the columns.
> >>>
> >>> Here is current behavior:
> >>>
> >>> /mytdir/mysdir/myfile.json
> >>> {a:1,b:2,c:3}
> >>> {a:4,b:5,c:6}
> >>>
> >>> select * from `myfile.json`
> >>>
> >>> a, b, c
> >>> 1, 2, 3
> >>> 4, 5, 6
> >>>
> >>> select * from `/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mysdir, 1, 2, 3
> >>> mysdir, 4, 5, 6
> >>>
> >>> select * from `/mytdir/mysdir/myfile.json`
> >>>
> >>> dir0, dir1 a, b, c
> >>> mytdir, mysdir, 1, 2, 3
> >>> mytdir, mysdir, 4, 5, 6
> >>>
> >>>
> >>> ====================================
> >>> My proposal:
> >>>
> >>> select * from `myfile.json`
> >>> select * from `/mysdir/myfile.json`
> >>> select * from `/mytdir/mysdir/myfile.json`
> >>> ::all produce::
> >>> a, b, c
> >>> 1, 2, 3
> >>> 4, 5, 6
> >>>
> >>> select dir0, a, b, c from `/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mysdir, 1, 2, 3
> >>> mysdir, 4, 5, 6
> >>>
> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mytdir, 1, 2, 3
> >>> mytdir, 4, 5, 6
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]>
> wrote:
> >>>
> >>>  Seems reasonable, as long as SELECT * also returns the dir# columns.
> >>>>
> >>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <[email protected]>
> >>>> wrote:
> >>>>
> >>>>  Hey guys,
> >>>>>
> >>>>> I've been thinking that always showing dir# columns seems to alter
> data
> >>>>> returned from Drill depending on how you select the directory.  I'd
> >>>>>
> >>>> propose
> >>>>
> >>>>> that we make it so that we only return dir# columns when they are
> >>>>> explicitly requested.
> >>>>>
> >>>>> Thoughts?
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> > --
> > Daniel Barclay
> > MapR Technologies
> >
>

Reply via email to