[GitHub] [drill] paul-rogers commented on issue #1829: DRILL-7096: Develop vector for canonical Map

GitBox Sun, 28 Jul 2019 20:10:35 -0700

paul-rogers commented on issue #1829: DRILL-7096: Develop vector for canonical 
Map<K,V>
URL: https://github.com/apache/drill/pull/1829#issuecomment-515828830
 
 
   @arina-ielchiieva, checked the dev archives: looks the previously suggested 
name was DICT. We could also allow DICTIONARY (similar to how we allow both 
VARCHAR and CHARACTER VARYING). The DICT name might be better since a MAP<K,V> 
is not like a generalized Python or JSON map (as we discussed.) We are actually 
defining a dictionary of keys and values in which the values must all be of the 
same type. I don't feel strongly about the name, but I did think it was a good 
idea when Ted suggested it.
   
   As we can see, a huge amount of work went into this feature. (@KazydubB 
clearly did a fantastic job understanding all the bits and pieces.) I was kind 
of hoping we could handle the DICT type as just a bit of metadata sugar on top 
of the existing REPEATED MAP vector. This PR does do that, of course. Can we 
make the change with less impact?
   
   The rough idea might be to create a new base class for both `RepeatedMap` 
and DICT. (Maybe we call it `AbstractRepeatedMap`). The structure would be 
identical for both, only the metadata bits would vary. A repeated map would 
allow adding any number of vectors, the DICT would take a `MaterializedField` 
that has the required key and value children in place, then create those two 
child vectors.
   
   With this, any code that currently references `RepeatedMapVector` could 
change to reference `AbstractRepeatedMap` instead. Immediately, all existing 
code that works with repeated maps (including EVF) would work with `DICT`, 
especially when reading the vector. That is, if code does not really care about 
the meaning of the children, then it should not care if the vector is a 
`RepeatedMap`, or the more restricted form, a `DICT`. Flatten, for example, 
should work just as well for either case.
   
    The one part we'd want to add is the step to create the `DICT` vector and 
its `MaterializedField`. Presumably that code is in this PR for some cases.  
For the RowSet and EVF stuff, we'd want to specify the `DICT<K,V>` info in 
metadata. We'd prohibit adding new children to a `DICT`. `RepeatedMap` would 
continue to allow adding new children. It may be that this is really the only 
difference.
   
   Suppose we used the old-style complex writer or the newer column writer. The 
client should not care if it is writing a MAP or DICT: it will just work with 
fields. Here's the RowSet version treating a `DICT` as a repeated `MAP`:
   
   ```
   TupleMetadata schema = ...
      // Convenience short-hand to assume the key is a VARCHAR, we specify the 
value type.
      .addDict("contacts", MinorType.VARCHAR) ...
   // Create row set and writer here...
   ArrayWriter dictWriter = rowSetWriter.array("contacts");
   TupleWriter kvWriter = dictWriter.tuple();
   kvWriter.scalar("key").setString("fred");
   kvWriter.scalar("value").setString("555-1212");
   dictWriter.save()
   ```
   
   Similarly, for read:
   
   ```
   RowSetReader reader = rowSetWithDict.reader();
   ArrayReader dictReader = reader.array("contacts");
   TupleReader kvReader = dictReader.tuple();
   ...
   // On each row
   while (dictReader.next()) {
     System.println(dictReader.scalar("key").getString() + ", " +
      dictReader.scalar("value").getString());
   ```
   
   The above cleanly handles the case where the key and value types can be 
anything. The above works for values that are unions, maps, etc.
   
   Also, the `RowSetBuilder` should just work to build a dictionary, and the 
verifier should also work to compare two row sets.
   
   Let's compare the above idea with the current PR. In this PR, we do create a 
new vector, but we add a bit too much specialized semantics. Doing so then 
forces us to create new complex readers and writers. For the RowSet framework, 
we'd have to add new column readers and writers. And, since we are creating all 
this new code, we need unit tests for it.
   
   That is, we are not exploiting the fact that a DICT really is a repeated 
MAP, just one with some restrictions. If, instead, we could add DICT as just a 
new way to create a kind of repeated MAP, and treat it as a repeated map for 
reads and writes, then we can leverage all the existing repeated map tests; 
we'd just add tests to ensure that the DICT creation code works.
   
   To summarize, can we implement a DICT as mostly new metadata, and new 
mechanisms for vector creation, but leave vector writing and reading pretty 
much the same as a generic repeated map?
   
   These are all just suggestions and brainstorming; perhaps @KazydubB already 
looked into this approach and found issues. Would be helpful to know what issue 
came up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on issue #1829: DRILL-7096: Develop vector for canonical Map

Reply via email to