Re: Should we make dir* columns only exist when requested?

Tomer Shiran Thu, 23 Apr 2015 19:30:17 -0700

+1 to adding the filename (needed this last week, I had <user_id>.json files 
and wanted to join with another table)
+1 to using an array dirs[]
+1 to not having it in * (but would "select dirs, *" work?)




> On Apr 23, 2015, at 7:00 PM, Steven Phillips <[email protected]> wrote:
> 
> What you are showing for the current behavior seems wrong to me:
> 
> $ tree mytdir
> mytdir
> └── mysdir
>    └── myFile.json
> 
> $ cat mytdir/mysdir/myFile.json
> {a:1,b:2,c:3}
> {a:4,b:5,c:6}
> 
> 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
> +------------+------------+------------+
> |     a      |     b      |     c      |
> +------------+------------+------------+
> | 1          | 2          | 3          |
> | 4          | 5          | 6          |
> +------------+------------+------------+
> 2 rows selected (0.274 seconds)
> 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
> +------------+------------+------------+
> |     a      |     b      |     c      |
> +------------+------------+------------+
> | 1          | 2          | 3          |
> | 4          | 5          | 6          |
> +------------+------------+------------+
> 2 rows selected (0.152 seconds)
> 0: jdbc:drill:> select * from `/mytdir/mysdir`;
> +------------+------------+------------+
> |     a      |     b      |     c      |
> +------------+------------+------------+
> | 1          | 2          | 3          |
> | 4          | 5          | 6          |
> +------------+------------+------------+
> 2 rows selected (0.157 seconds)
> 0: jdbc:drill:> select * from `mytdir`;
> +------------+------------+------------+------------+
> |    dir0    |     a      |     b      |     c      |
> +------------+------------+------------+------------+
> | mysdir     | 1          | 2          | 3          |
> | mysdir     | 4          | 5          | 6          |
> +------------+------------+------------+------------+
> 
> I don't know why in your example, you are getting a dir0 directory when
> selecting a specific file. These directories should only be included when
> the specified table is a directory which contains subdirectories. Any query
> to a specific file or to a directory that only contains regular files
> should not return dir* columns.
> I think this is the correct behavior.
> 
> The fact that `mytidir` and `mytdir/mysdir` have different columns is not a
> problem, because they are different tables.
> 
> I do think Daniel's idea of adding the file name as well makes sense. I'm
> also open to Ted's idea for return a dir array instead of individual
> columns.
> 
> On Thu, Apr 23, 2015 at 6:36 PM, Julian Hyde <[email protected]> wrote:
> 
>>> Ted wrote:
>>> 
>>> For one thing, I can make a really slow version of [find] !
>> 
>> Why does it have to be slow? Seriously, so many of the tools we use
>> daily have quasi-query facilities (find, git log, du, ps, netstat) and
>> we cobble together queries using complex options and pipelines of unix
>> commands. Relational algebra is a potentially MORE efficient.
>> 
>> I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily
>> and wish I could write ' ... order by count(*) desc'.
>> 
>>> On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde <[email protected]> wrote:
>>> +1 to returning directories as context. Very useful feature. Could be
>>> used to return context for other adapters (e.g. an adapter that
>>> concatenates all versions of versioned logfiles).
>>> 
>>> +1 making dir an array, per Ted's suggestion
>>> 
>>> I think dir should not appear in *; thus you'd have to write
>>> 
>>>  select dir, * from `/mytdir/mysdir/myfile.json`
>>> 
>>> This behavior is analogous to Oracle's ROWID. It is not a column as
>>> such, but a system function that you can apply to a row.
>>> 
>>> You need to allow qualifiers:
>>> 
>>>  select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
>>> x, `/mytdir/mysdir/myfile2.json` as y
>>> 
>>> and
>>> 
>>>  select dir from `/mytdir/mysdir/myfile.json` as x,
>>> `/mytdir/mysdir/myfile2.json` as y
>>> 
>>> would be illegal because dir is ambiguous.
>>> 
>>> You should make dir a reserved word (like ROWID).
>>> 
>>> On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning <[email protected]>
>> wrote:
>>>> Great point.
>>>> 
>>>> Having the file name itself is very handy.
>>>> 
>>>> 
>>>> For one thing, I can make a really slow version of [find] !
>>>> 
>>>> (seriously, I would love this)
>>>> 
>>>> 
>>>> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
>>>> [email protected]> wrote:
>>>> 
>>>>> I am also under the opinion that we should not assume knowledge on the
>> user
>>>>> front for data discovery. So we should either have 'dir' columns in
>> 'select
>>>>> *' or support a variation that Ted suggested.
>>>>> Also the folder names compliment the actual data in some cases.
>>>>> 
>>>>> - Rahul
>>>>> 
>>>>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <[email protected]
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Regarding the use case in which the user stores information in
>> pathnames:
>>>>>> 
>>>>>> Since Drill supports that use case partially, shouldn't it do so more
>>>>>> completely?  In particular, since Drill provides access to subtree
>>>>>> pathname segments before the last one (the segments for directories),
>>>>>> should Drill provide access to the last one too (the simple file
>> name)?
>>>>>> 
>>>>>> 
>>>>>> We support reading cases like this:
>>>>>> - root/
>>>>>> - root/2015/
>>>>>> - root/2015/01/
>>>>>> - root/2015/01/01/
>>>>>> - root/2015/01/01/log.json
>>>>>> - root/2015/02/
>>>>>> - root/2015/02/02/
>>>>>> - root/2015/02/02/log.json
>>>>>> 
>>>>>> In particular, querying "select ... from `root` ..." includes the
>>>>>> date-portion segments of the pathnames in the dir0, etc, columns.
>>>>>> 
>>>>>> Note that the user might not redundantly store the dates inside the
>>>>>> files themselves, since the dates are known to exist in the directory
>>>>>> names.
>>>>>> 
>>>>>> 
>>>>>> However, we don't support this variation of that case, right?:
>>>>>> 
>>>>>> - root/
>>>>>> - root/2015
>>>>>> - root/2015/01/
>>>>>> - root/2015/01/log_01.json
>>>>>> - root/2015/02/
>>>>>> - root/2015/02/log_02.json
>>>>>> 
>>>>>> In particular, Drill includes several segments of the pathname after
>>>>>> the root of the subtree, but does not include the last segment--which
>>>>>> contains data just as the segments that _are_ included do.
>>>>>> 
>>>>>> (Yes, the last segment usually contains artifacts besides the
>> contained
>>>>>> data (e.g., the file extension) and the user would have to specify
>> how
>>>>>> to interpret the file simple name segment as data, but the user has
>> to
>>>>>> specify the interpretation for the other segments anyway.)
>>>>>> 
>>>>>> 
>>>>>> Daniel
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Ted Dunning wrote:
>>>>>> 
>>>>>>> I would propose that dir be an array that contains all of the
>>>>> directories
>>>>>>> rather than having multiple values.
>>>>>>> 
>>>>>>> The multiple names are particularly inconvenient if files are are
>>>>>>> different
>>>>>>> depths.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <[email protected]
>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> I'm specifically arguing that SELECT * doesn't return the columns.
>>>>>>>> 
>>>>>>>> Here is current behavior:
>>>>>>>> 
>>>>>>>> /mytdir/mysdir/myfile.json
>>>>>>>> {a:1,b:2,c:3}
>>>>>>>> {a:4,b:5,c:6}
>>>>>>>> 
>>>>>>>> select * from `myfile.json`
>>>>>>>> 
>>>>>>>> a, b, c
>>>>>>>> 1, 2, 3
>>>>>>>> 4, 5, 6
>>>>>>>> 
>>>>>>>> select * from `/mysdir/myfile.json`
>>>>>>>> 
>>>>>>>> dir0 a, b, c
>>>>>>>> mysdir, 1, 2, 3
>>>>>>>> mysdir, 4, 5, 6
>>>>>>>> 
>>>>>>>> select * from `/mytdir/mysdir/myfile.json`
>>>>>>>> 
>>>>>>>> dir0, dir1 a, b, c
>>>>>>>> mytdir, mysdir, 1, 2, 3
>>>>>>>> mytdir, mysdir, 4, 5, 6
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ====================================
>>>>>>>> My proposal:
>>>>>>>> 
>>>>>>>> select * from `myfile.json`
>>>>>>>> select * from `/mysdir/myfile.json`
>>>>>>>> select * from `/mytdir/mysdir/myfile.json`
>>>>>>>> ::all produce::
>>>>>>>> a, b, c
>>>>>>>> 1, 2, 3
>>>>>>>> 4, 5, 6
>>>>>>>> 
>>>>>>>> select dir0, a, b, c from `/mysdir/myfile.json`
>>>>>>>> 
>>>>>>>> dir0 a, b, c
>>>>>>>> mysdir, 1, 2, 3
>>>>>>>> mysdir, 4, 5, 6
>>>>>>>> 
>>>>>>>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
>>>>>>>> 
>>>>>>>> dir0 a, b, c
>>>>>>>> mytdir, 1, 2, 3
>>>>>>>> mytdir, 4, 5, 6
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <[email protected]>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> Seems reasonable, as long as SELECT * also returns the dir#
>> columns.
>>>>>>>>> 
>>>>>>>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <
>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hey guys,
>>>>>>>>>> 
>>>>>>>>>> I've been thinking that always showing dir# columns seems to
>> alter
>>>>> data
>>>>>>>>>> returned from Drill depending on how you select the directory.
>> I'd
>>>>>>>>> propose
>>>>>>>>> 
>>>>>>>>>> that we make it so that we only return dir# columns when they are
>>>>>>>>>> explicitly requested.
>>>>>>>>>> 
>>>>>>>>>> Thoughts?
>>>>>> 
>>>>>> --
>>>>>> Daniel Barclay
>>>>>> MapR Technologies
> 
> 
> 
> -- 
> Steven Phillips
> Software Engineer
> 
> mapr.com

Re: Should we make dir* columns only exist when requested?

Reply via email to