I am also under the opinion that we should not assume knowledge on the user
front for data discovery. So we should either have 'dir' columns in 'select
*' or support a variation that Ted suggested.
Also the folder names compliment the actual data in some cases.

- Rahul

On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <[email protected]>
wrote:

> Regarding the use case in which the user stores information in pathnames:
>
> Since Drill supports that use case partially, shouldn't it do so more
> completely?  In particular, since Drill provides access to subtree
> pathname segments before the last one (the segments for directories),
> should Drill provide access to the last one too (the simple file name)?
>
>
> We support reading cases like this:
> - root/
> - root/2015/
> - root/2015/01/
> - root/2015/01/01/
> - root/2015/01/01/log.json
> - root/2015/02/
> - root/2015/02/02/
> - root/2015/02/02/log.json
>
> In particular, querying "select ... from `root` ..." includes the
> date-portion segments of the pathnames in the dir0, etc, columns.
>
> Note that the user might not redundantly store the dates inside the
> files themselves, since the dates are known to exist in the directory
> names.
>
>
> However, we don't support this variation of that case, right?:
>
> - root/
> - root/2015
> - root/2015/01/
> - root/2015/01/log_01.json
> - root/2015/02/
> - root/2015/02/log_02.json
>
> In particular, Drill includes several segments of the pathname after
> the root of the subtree, but does not include the last segment--which
> contains data just as the segments that _are_ included do.
>
> (Yes, the last segment usually contains artifacts besides the contained
> data (e.g., the file extension) and the user would have to specify how
> to interpret the file simple name segment as data, but the user has to
> specify the interpretation for the other segments anyway.)
>
>
> Daniel
>
>
>
> Ted Dunning wrote:
>
>> I would propose that dir be an array that contains all of the directories
>> rather than having multiple values.
>>
>> The multiple names are particularly inconvenient if files are are
>> different
>> depths.
>>
>>
>>
>> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]>
>> wrote:
>>
>>  I'm specifically arguing that SELECT * doesn't return the columns.
>>>
>>> Here is current behavior:
>>>
>>> /mytdir/mysdir/myfile.json
>>> {a:1,b:2,c:3}
>>> {a:4,b:5,c:6}
>>>
>>> select * from `myfile.json`
>>>
>>> a, b, c
>>> 1, 2, 3
>>> 4, 5, 6
>>>
>>> select * from `/mysdir/myfile.json`
>>>
>>> dir0 a, b, c
>>> mysdir, 1, 2, 3
>>> mysdir, 4, 5, 6
>>>
>>> select * from `/mytdir/mysdir/myfile.json`
>>>
>>> dir0, dir1 a, b, c
>>> mytdir, mysdir, 1, 2, 3
>>> mytdir, mysdir, 4, 5, 6
>>>
>>>
>>> ====================================
>>> My proposal:
>>>
>>> select * from `myfile.json`
>>> select * from `/mysdir/myfile.json`
>>> select * from `/mytdir/mysdir/myfile.json`
>>> ::all produce::
>>> a, b, c
>>> 1, 2, 3
>>> 4, 5, 6
>>>
>>> select dir0, a, b, c from `/mysdir/myfile.json`
>>>
>>> dir0 a, b, c
>>> mysdir, 1, 2, 3
>>> mysdir, 4, 5, 6
>>>
>>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
>>>
>>> dir0 a, b, c
>>> mytdir, 1, 2, 3
>>> mytdir, 4, 5, 6
>>>
>>>
>>>
>>>
>>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]> wrote:
>>>
>>>  Seems reasonable, as long as SELECT * also returns the dir# columns.
>>>>
>>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <[email protected]>
>>>> wrote:
>>>>
>>>>  Hey guys,
>>>>>
>>>>> I've been thinking that always showing dir# columns seems to alter data
>>>>> returned from Drill depending on how you select the directory.  I'd
>>>>>
>>>> propose
>>>>
>>>>> that we make it so that we only return dir# columns when they are
>>>>> explicitly requested.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>
>>>
>>
>
> --
> Daniel Barclay
> MapR Technologies
>

Reply via email to