Re: Should we make dir* columns only exist when requested?

Julian Hyde Thu, 23 Apr 2015 18:28:35 -0700

+1 to returning directories as context. Very useful feature. Could be
used to return context for other adapters (e.g. an adapter that
concatenates all versions of versioned logfiles).


+1 making dir an array, per Ted's suggestion

I think dir should not appear in *; thus you'd have to write

  select dir, * from `/mytdir/mysdir/myfile.json`

This behavior is analogous to Oracle's ROWID. It is not a column as
such, but a system function that you can apply to a row.

You need to allow qualifiers:

  select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
x, `/mytdir/mysdir/myfile2.json` as y

and

  select dir from `/mytdir/mysdir/myfile.json` as x,
`/mytdir/mysdir/myfile2.json` as y

would be illegal because dir is ambiguous.

You should make dir a reserved word (like ROWID).

On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning <[email protected]> wrote:
> Great point.
>
> Having the file name itself is very handy.
>
>
> For one thing, I can make a really slow version of [find] !
>
> (seriously, I would love this)
>
>
> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
> [email protected]> wrote:
>
>> I am also under the opinion that we should not assume knowledge on the user
>> front for data discovery. So we should either have 'dir' columns in 'select
>> *' or support a variation that Ted suggested.
>> Also the folder names compliment the actual data in some cases.
>>
>> - Rahul
>>
>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <[email protected]>
>> wrote:
>>
>> > Regarding the use case in which the user stores information in pathnames:
>> >
>> > Since Drill supports that use case partially, shouldn't it do so more
>> > completely?  In particular, since Drill provides access to subtree
>> > pathname segments before the last one (the segments for directories),
>> > should Drill provide access to the last one too (the simple file name)?
>> >
>> >
>> > We support reading cases like this:
>> > - root/
>> > - root/2015/
>> > - root/2015/01/
>> > - root/2015/01/01/
>> > - root/2015/01/01/log.json
>> > - root/2015/02/
>> > - root/2015/02/02/
>> > - root/2015/02/02/log.json
>> >
>> > In particular, querying "select ... from `root` ..." includes the
>> > date-portion segments of the pathnames in the dir0, etc, columns.
>> >
>> > Note that the user might not redundantly store the dates inside the
>> > files themselves, since the dates are known to exist in the directory
>> > names.
>> >
>> >
>> > However, we don't support this variation of that case, right?:
>> >
>> > - root/
>> > - root/2015
>> > - root/2015/01/
>> > - root/2015/01/log_01.json
>> > - root/2015/02/
>> > - root/2015/02/log_02.json
>> >
>> > In particular, Drill includes several segments of the pathname after
>> > the root of the subtree, but does not include the last segment--which
>> > contains data just as the segments that _are_ included do.
>> >
>> > (Yes, the last segment usually contains artifacts besides the contained
>> > data (e.g., the file extension) and the user would have to specify how
>> > to interpret the file simple name segment as data, but the user has to
>> > specify the interpretation for the other segments anyway.)
>> >
>> >
>> > Daniel
>> >
>> >
>> >
>> > Ted Dunning wrote:
>> >
>> >> I would propose that dir be an array that contains all of the
>> directories
>> >> rather than having multiple values.
>> >>
>> >> The multiple names are particularly inconvenient if files are are
>> >> different
>> >> depths.
>> >>
>> >>
>> >>
>> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]>
>> >> wrote:
>> >>
>> >>  I'm specifically arguing that SELECT * doesn't return the columns.
>> >>>
>> >>> Here is current behavior:
>> >>>
>> >>> /mytdir/mysdir/myfile.json
>> >>> {a:1,b:2,c:3}
>> >>> {a:4,b:5,c:6}
>> >>>
>> >>> select * from `myfile.json`
>> >>>
>> >>> a, b, c
>> >>> 1, 2, 3
>> >>> 4, 5, 6
>> >>>
>> >>> select * from `/mysdir/myfile.json`
>> >>>
>> >>> dir0 a, b, c
>> >>> mysdir, 1, 2, 3
>> >>> mysdir, 4, 5, 6
>> >>>
>> >>> select * from `/mytdir/mysdir/myfile.json`
>> >>>
>> >>> dir0, dir1 a, b, c
>> >>> mytdir, mysdir, 1, 2, 3
>> >>> mytdir, mysdir, 4, 5, 6
>> >>>
>> >>>
>> >>> ====================================
>> >>> My proposal:
>> >>>
>> >>> select * from `myfile.json`
>> >>> select * from `/mysdir/myfile.json`
>> >>> select * from `/mytdir/mysdir/myfile.json`
>> >>> ::all produce::
>> >>> a, b, c
>> >>> 1, 2, 3
>> >>> 4, 5, 6
>> >>>
>> >>> select dir0, a, b, c from `/mysdir/myfile.json`
>> >>>
>> >>> dir0 a, b, c
>> >>> mysdir, 1, 2, 3
>> >>> mysdir, 4, 5, 6
>> >>>
>> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
>> >>>
>> >>> dir0 a, b, c
>> >>> mytdir, 1, 2, 3
>> >>> mytdir, 4, 5, 6
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]>
>> wrote:
>> >>>
>> >>>  Seems reasonable, as long as SELECT * also returns the dir# columns.
>> >>>>
>> >>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <[email protected]>
>> >>>> wrote:
>> >>>>
>> >>>>  Hey guys,
>> >>>>>
>> >>>>> I've been thinking that always showing dir# columns seems to alter
>> data
>> >>>>> returned from Drill depending on how you select the directory.  I'd
>> >>>>>
>> >>>> propose
>> >>>>
>> >>>>> that we make it so that we only return dir# columns when they are
>> >>>>> explicitly requested.
>> >>>>>
>> >>>>> Thoughts?
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>> > --
>> > Daniel Barclay
>> > MapR Technologies
>> >
>>

Re: Should we make dir* columns only exist when requested?

Reply via email to