> Ted wrote:
>
> For one thing, I can make a really slow version of [find] !

Why does it have to be slow? Seriously, so many of the tools we use
daily have quasi-query facilities (find, git log, du, ps, netstat) and
we cobble together queries using complex options and pipelines of unix
commands. Relational algebra is a potentially MORE efficient.

I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily
and wish I could write ' ... order by count(*) desc'.

On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde <[email protected]> wrote:
> +1 to returning directories as context. Very useful feature. Could be
> used to return context for other adapters (e.g. an adapter that
> concatenates all versions of versioned logfiles).
>
> +1 making dir an array, per Ted's suggestion
>
> I think dir should not appear in *; thus you'd have to write
>
>   select dir, * from `/mytdir/mysdir/myfile.json`
>
> This behavior is analogous to Oracle's ROWID. It is not a column as
> such, but a system function that you can apply to a row.
>
> You need to allow qualifiers:
>
>   select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
> x, `/mytdir/mysdir/myfile2.json` as y
>
> and
>
>   select dir from `/mytdir/mysdir/myfile.json` as x,
> `/mytdir/mysdir/myfile2.json` as y
>
> would be illegal because dir is ambiguous.
>
> You should make dir a reserved word (like ROWID).
>
> On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning <[email protected]> wrote:
>> Great point.
>>
>> Having the file name itself is very handy.
>>
>>
>> For one thing, I can make a really slow version of [find] !
>>
>> (seriously, I would love this)
>>
>>
>> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
>> [email protected]> wrote:
>>
>>> I am also under the opinion that we should not assume knowledge on the user
>>> front for data discovery. So we should either have 'dir' columns in 'select
>>> *' or support a variation that Ted suggested.
>>> Also the folder names compliment the actual data in some cases.
>>>
>>> - Rahul
>>>
>>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <[email protected]>
>>> wrote:
>>>
>>> > Regarding the use case in which the user stores information in pathnames:
>>> >
>>> > Since Drill supports that use case partially, shouldn't it do so more
>>> > completely?  In particular, since Drill provides access to subtree
>>> > pathname segments before the last one (the segments for directories),
>>> > should Drill provide access to the last one too (the simple file name)?
>>> >
>>> >
>>> > We support reading cases like this:
>>> > - root/
>>> > - root/2015/
>>> > - root/2015/01/
>>> > - root/2015/01/01/
>>> > - root/2015/01/01/log.json
>>> > - root/2015/02/
>>> > - root/2015/02/02/
>>> > - root/2015/02/02/log.json
>>> >
>>> > In particular, querying "select ... from `root` ..." includes the
>>> > date-portion segments of the pathnames in the dir0, etc, columns.
>>> >
>>> > Note that the user might not redundantly store the dates inside the
>>> > files themselves, since the dates are known to exist in the directory
>>> > names.
>>> >
>>> >
>>> > However, we don't support this variation of that case, right?:
>>> >
>>> > - root/
>>> > - root/2015
>>> > - root/2015/01/
>>> > - root/2015/01/log_01.json
>>> > - root/2015/02/
>>> > - root/2015/02/log_02.json
>>> >
>>> > In particular, Drill includes several segments of the pathname after
>>> > the root of the subtree, but does not include the last segment--which
>>> > contains data just as the segments that _are_ included do.
>>> >
>>> > (Yes, the last segment usually contains artifacts besides the contained
>>> > data (e.g., the file extension) and the user would have to specify how
>>> > to interpret the file simple name segment as data, but the user has to
>>> > specify the interpretation for the other segments anyway.)
>>> >
>>> >
>>> > Daniel
>>> >
>>> >
>>> >
>>> > Ted Dunning wrote:
>>> >
>>> >> I would propose that dir be an array that contains all of the
>>> directories
>>> >> rather than having multiple values.
>>> >>
>>> >> The multiple names are particularly inconvenient if files are are
>>> >> different
>>> >> depths.
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]>
>>> >> wrote:
>>> >>
>>> >>  I'm specifically arguing that SELECT * doesn't return the columns.
>>> >>>
>>> >>> Here is current behavior:
>>> >>>
>>> >>> /mytdir/mysdir/myfile.json
>>> >>> {a:1,b:2,c:3}
>>> >>> {a:4,b:5,c:6}
>>> >>>
>>> >>> select * from `myfile.json`
>>> >>>
>>> >>> a, b, c
>>> >>> 1, 2, 3
>>> >>> 4, 5, 6
>>> >>>
>>> >>> select * from `/mysdir/myfile.json`
>>> >>>
>>> >>> dir0 a, b, c
>>> >>> mysdir, 1, 2, 3
>>> >>> mysdir, 4, 5, 6
>>> >>>
>>> >>> select * from `/mytdir/mysdir/myfile.json`
>>> >>>
>>> >>> dir0, dir1 a, b, c
>>> >>> mytdir, mysdir, 1, 2, 3
>>> >>> mytdir, mysdir, 4, 5, 6
>>> >>>
>>> >>>
>>> >>> ====================================
>>> >>> My proposal:
>>> >>>
>>> >>> select * from `myfile.json`
>>> >>> select * from `/mysdir/myfile.json`
>>> >>> select * from `/mytdir/mysdir/myfile.json`
>>> >>> ::all produce::
>>> >>> a, b, c
>>> >>> 1, 2, 3
>>> >>> 4, 5, 6
>>> >>>
>>> >>> select dir0, a, b, c from `/mysdir/myfile.json`
>>> >>>
>>> >>> dir0 a, b, c
>>> >>> mysdir, 1, 2, 3
>>> >>> mysdir, 4, 5, 6
>>> >>>
>>> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
>>> >>>
>>> >>> dir0 a, b, c
>>> >>> mytdir, 1, 2, 3
>>> >>> mytdir, 4, 5, 6
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]>
>>> wrote:
>>> >>>
>>> >>>  Seems reasonable, as long as SELECT * also returns the dir# columns.
>>> >>>>
>>> >>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <[email protected]>
>>> >>>> wrote:
>>> >>>>
>>> >>>>  Hey guys,
>>> >>>>>
>>> >>>>> I've been thinking that always showing dir# columns seems to alter
>>> data
>>> >>>>> returned from Drill depending on how you select the directory.  I'd
>>> >>>>>
>>> >>>> propose
>>> >>>>
>>> >>>>> that we make it so that we only return dir# columns when they are
>>> >>>>> explicitly requested.
>>> >>>>>
>>> >>>>> Thoughts?
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>> > --
>>> > Daniel Barclay
>>> > MapR Technologies
>>> >
>>>

Reply via email to