Regarding the use case in which the user stores information in pathnames:

Since Drill supports that use case partially, shouldn't it do so more
completely?  In particular, since Drill provides access to subtree
pathname segments before the last one (the segments for directories),
should Drill provide access to the last one too (the simple file name)?


We support reading cases like this:
- root/
- root/2015/
- root/2015/01/
- root/2015/01/01/
- root/2015/01/01/log.json
- root/2015/02/
- root/2015/02/02/
- root/2015/02/02/log.json

In particular, querying "select ... from `root` ..." includes the
date-portion segments of the pathnames in the dir0, etc, columns.

Note that the user might not redundantly store the dates inside the
files themselves, since the dates are known to exist in the directory
names.


However, we don't support this variation of that case, right?:

- root/
- root/2015
- root/2015/01/
- root/2015/01/log_01.json
- root/2015/02/
- root/2015/02/log_02.json

In particular, Drill includes several segments of the pathname after
the root of the subtree, but does not include the last segment--which
contains data just as the segments that _are_ included do.

(Yes, the last segment usually contains artifacts besides the contained
data (e.g., the file extension) and the user would have to specify how
to interpret the file simple name segment as data, but the user has to
specify the interpretation for the other segments anyway.)


Daniel


Ted Dunning wrote:
I would propose that dir be an array that contains all of the directories
rather than having multiple values.

The multiple names are particularly inconvenient if files are are different
depths.



On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]> wrote:

I'm specifically arguing that SELECT * doesn't return the columns.

Here is current behavior:

/mytdir/mysdir/myfile.json
{a:1,b:2,c:3}
{a:4,b:5,c:6}

select * from `myfile.json`

a, b, c
1, 2, 3
4, 5, 6

select * from `/mysdir/myfile.json`

dir0 a, b, c
mysdir, 1, 2, 3
mysdir, 4, 5, 6

select * from `/mytdir/mysdir/myfile.json`

dir0, dir1 a, b, c
mytdir, mysdir, 1, 2, 3
mytdir, mysdir, 4, 5, 6


====================================
My proposal:

select * from `myfile.json`
select * from `/mysdir/myfile.json`
select * from `/mytdir/mysdir/myfile.json`
::all produce::
a, b, c
1, 2, 3
4, 5, 6

select dir0, a, b, c from `/mysdir/myfile.json`

dir0 a, b, c
mysdir, 1, 2, 3
mysdir, 4, 5, 6

select dir0, a, b, c from `/mytdir/mysdir/myfile.json`

dir0 a, b, c
mytdir, 1, 2, 3
mytdir, 4, 5, 6




On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]> wrote:

Seems reasonable, as long as SELECT * also returns the dir# columns.

On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <[email protected]>
wrote:

Hey guys,

I've been thinking that always showing dir# columns seems to alter data
returned from Drill depending on how you select the directory.  I'd
propose
that we make it so that we only return dir# columns when they are
explicitly requested.

Thoughts?






--
Daniel Barclay
MapR Technologies

Reply via email to