[
https://issues.apache.org/jira/browse/DRILL-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894892#comment-16894892
]
ASF GitHub Bot commented on DRILL-7096:
---------------------------------------
paul-rogers commented on issue #1829: DRILL-7096: Develop vector for canonical
Map<K,V>
URL: https://github.com/apache/drill/pull/1829#issuecomment-515828830
@arina-ielchiieva, checked the dev archives: looks the previously suggested
name was DICT. We could also allow DICTIONARY (similar to how we allow both
VARCHAR and CHARACTER VARYING). The DICT name might be better since a MAP<K,V>
is not like a generalized Python or JSON map (as we discussed.) We are actually
defining a dictionary of keys and values in which the values must all be of the
same type. I don't feel strongly about the name, but I did think it was a good
idea when Ted suggested it.
As we can see, a huge amount of work went into this feature. (@KazydubB
clearly did a fantastic job understanding all the bits and pieces.) I was kind
of hoping we could handle the DICT type as just a bit of metadata sugar on top
of the existing REPEATED MAP vector. This PR does do that, of course. Can we
make the change with less impact?
The rough idea might be to create a new base class for both `RepeatedMap`
and DICT. (Maybe we call it `AbstractRepeatedMap`). The structure would be
identical for both, only the metadata bits would vary. A repeated map would
allow adding any number of vectors, the DICT would take a `MaterializedField`
that has the required key and value children in place, then create those two
child vectors.
With this, any code that currently references `RepeatedMapVector` could
change to reference `AbstractRepeatedMap` instead. Immediately, all existing
code that works with repeated maps (including EVF) would work with `DICT`,
especially when reading the vector. That is, if code does not really care about
the meaning of the children, then it should not care if the vector is a
`RepeatedMap`, or the more restricted form, a `DICT`. Flatten, for example,
should work just as well for either case.
The one part we'd want to add is the step to create the `DICT` vector and
its `MaterializedField`. Presumably that code is in this PR for some cases.
For the RowSet and EVF stuff, we'd want to specify the `DICT<K,V>` info in
metadata. We'd prohibit adding new children to a `DICT`. `RepeatedMap` would
continue to allow adding new children. It may be that this is really the only
difference.
Suppose we used the old-style complex writer or the newer column writer. The
client should not care if it is writing a MAP or DICT: it will just work with
fields. Here's the RowSet version treating a `DICT` as a repeated `MAP`:
```
TupleMetadata schema = ...
// Convenience short-hand to assume the key is a VARCHAR, we specify the
value type.
.addDict("contacts", MinorType.VARCHAR) ...
// Create row set and writer here...
ArrayWriter dictWriter = rowSetWriter.array("contacts");
TupleWriter kvWriter = dictWriter.tuple();
kvWriter.scalar("key").setString("fred");
kvWriter.scalar("value").setString("555-1212");
dictWriter.save()
```
Similarly, for read:
```
RowSetReader reader = rowSetWithDict.reader();
ArrayReader dictReader = reader.array("contacts");
TupleReader kvReader = dictReader.tuple();
...
// On each row
while (dictReader.next()) {
System.println(dictReader.scalar("key").getString() + ", " +
dictReader.scalar("value").getString());
```
The above cleanly handles the case where the key and value types can be
anything. The above works for values that are unions, maps, etc.
Also, the `RowSetBuilder` should just work to build a dictionary, and the
verifier should also work to compare two row sets.
Let's compare the above idea with the current PR. In this PR, we do create a
new vector, but we add a bit too much specialized semantics. Doing so then
forces us to create new complex readers and writers. For the RowSet framework,
we'd have to add new column readers and writers. And, since we are creating all
this new code, we need unit tests for it.
That is, we are not exploiting the fact that a DICT really is a repeated
MAP, just one with some restrictions. If, instead, we could add DICT as just a
new way to create a kind of repeated MAP, and treat it as a repeated map for
reads and writes, then we can leverage all the existing repeated map tests;
we'd just add tests to ensure that the DICT creation code works.
To summarize, can we implement a DICT as mostly new metadata, and new
mechanisms for vector creation, but leave vector writing and reading pretty
much the same as a generic repeated map?
These are all just suggestions and brainstorming; perhaps @KazydubB already
looked into this approach and found issues. Would be helpful to know what issue
came up.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Develop vector for canonical Map<K,V>
> -------------------------------------
>
> Key: DRILL-7096
> URL: https://issues.apache.org/jira/browse/DRILL-7096
> Project: Apache Drill
> Issue Type: Sub-task
> Reporter: Igor Guzenko
> Assignee: Bohdan Kazydub
> Priority: Major
> Fix For: 1.17.0
>
>
> Canonical Map<K,V> datatype can be represented using combination of three
> value vectors:
> keysVector - vector for storing keys of each map
> valuesVector - vector for storing values of each map
> offsetsVector - vector for storing of start indexes of next each map
> So it's not very hard to create such Map vector, but there is a major issue
> with such map representation. It's hard to search maps values by key in such
> vector, need to investigate some advanced techniques to make such search
> efficient. Or find other more suitable options to represent map datatype in
> world of vectors.
> After question about maps, Apache Arrow developers responded that for Java
> they don't have real Map vector, for now they just have logical Map type
> definition where they define Map like: List< Struct<key:key_type,
> value:value_type> >. So implementation of value vector would be useful for
> Arrow too.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)