paul-rogers commented on issue #1829: DRILL-7096: Develop vector for canonical Map<K,V> URL: https://github.com/apache/drill/pull/1829#issuecomment-515828830 @arina-ielchiieva, checked the dev archives: looks the previously suggested name was DICT. We could also allow DICTIONARY (similar to how we allow both VARCHAR and CHARACTER VARYING). The DICT name might be better since a MAP<K,V> is not like a generalized Python or JSON map (as we discussed.) We are actually defining a dictionary of keys and values in which the values must all be of the same type. I don't feel strongly about the name, but I did think it was a good idea when Ted suggested it. As we can see, a huge amount of work went into this feature. (@KazydubB clearly did a fantastic job understanding all the bits and pieces.) I was kind of hoping we could handle the DICT type as just a bit of metadata sugar on top of the existing REPEATED MAP vector. This PR does do that, of course. Can we make the change with less impact? The rough idea might be to create a new base class for both `RepeatedMap` and DICT. (Maybe we call it `AbstractRepeatedMap`). The structure would be identical for both, only the metadata bits would vary. A repeated map would allow adding any number of vectors, the DICT would take a `MaterializedField` that has the required key and value children in place, then create those two child vectors. With this, any code that currently references `RepeatedMapVector` could change to reference `AbstractRepeatedMap` instead. Immediately, all existing code that works with repeated maps (including EVF) would work with `DICT`, especially when reading the vector. That is, if code does not really care about the meaning of the children, then it should not care if the vector is a `RepeatedMap`, or the more restricted form, a `DICT`. Flatten, for example, should work just as well for either case. The one part we'd want to add is the step to create the `DICT` vector and its `MaterializedField`. Presumably that code is in this PR for some cases. For the RowSet and EVF stuff, we'd want to specify the `DICT<K,V>` info in metadata. We'd prohibit adding new children to a `DICT`. `RepeatedMap` would continue to allow adding new children. It may be that this is really the only difference. Suppose we used the old-style complex writer or the newer column writer. The client should not care if it is writing a MAP or DICT: it will just work with fields. Here's the RowSet version treating a `DICT` as a repeated `MAP`: ``` TupleMetadata schema = ... // Convenience short-hand to assume the key is a VARCHAR, we specify the value type. .addDict("contacts", MinorType.VARCHAR) ... // Create row set and writer here... ArrayWriter dictWriter = rowSetWriter.array("contacts"); TupleWriter kvWriter = dictWriter.tuple(); kvWriter.scalar("key").setString("fred"); kvWriter.scalar("value").setString("555-1212"); dictWriter.save() ``` Similarly, for read: ``` RowSetReader reader = rowSetWithDict.reader(); ArrayReader dictReader = reader.array("contacts"); TupleReader kvReader = dictReader.tuple(); ... // On each row while (dictReader.next()) { System.println(dictReader.scalar("key").getString() + ", " + dictReader.scalar("value").getString()); ``` The above cleanly handles the case where the key and value types can be anything. The above works for values that are unions, maps, etc. Also, the `RowSetBuilder` should just work to build a dictionary, and the verifier should also work to compare two row sets. Let's compare the above idea with the current PR. In this PR, we do create a new vector, but we add a bit too much specialized semantics. Doing so then forces us to create new complex readers and writers. For the RowSet framework, we'd have to add new column readers and writers. And, since we are creating all this new code, we need unit tests for it. That is, we are not exploiting the fact that a DICT really is a repeated MAP, just one with some restrictions. If, instead, we could add DICT as just a new way to create a kind of repeated MAP, and treat it as a repeated map for reads and writes, then we can leverage all the existing repeated map tests; we'd just add tests to ensure that the DICT creation code works. To summarize, can we implement a DICT as mostly new metadata, and new mechanisms for vector creation, but leave vector writing and reading pretty much the same as a generic repeated map? These are all just suggestions and brainstorming; perhaps @KazydubB already looked into this approach and found issues. Would be helpful to know what issue came up.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
