Re: Problem of using ListVector for representing Hive arrays

Paul Rogers Wed, 06 Feb 2019 10:35:23 -0800

Hi Igor,

Hive complex type integration will be a valuable addition to Drill. You 
mentioned running into issues with List vector. I believe you will find that 
you'll encounter four separate issues.


First, the List vector is "experimental": the core functionality exists, but 
there are holes. List vector is semi-supported by the JSON reader, but not by 
downstream operators. Thus, even if you can create a list, your query may fail 
depending on the complexity of the query and the particular operators added to 
the plan.

Second, the List vector is very complex. It starts as a list of nulls, then 
morphs to a list of a single type, then morphs again to a list of Union type, 
with multiple type vectors hanging off of it. The Union type itself is also 
"experimental" and not well supported in Drill; again some operators don't 
support it.

Third, working with the List vector is very complex because of the fact that it 
changes structure and uses Union. Getting memory management correct is a 
challenge.

Fourth, every reader in Drill should control the size of its output record 
batch to avoid a number of memory-related issues. For example, you'll want to 
ensure that your batches are limited to, say, 10-20 MB in size and that no one 
vector is larger than 16 MB. This is quite hard when the only tool you have is 
batch record count. Fortunately, in your case, you have Hive stats such as 
average column width so that you can create a good per-batch-row-count estimate.

Now that we've explained the challenges you face, let's discuss some solutions. 
First, do you actually need LIST (and UNION)? The Hive complex types are listed 
in [1] as follows:

Complex Types
* arrays: ARRAY<data_type> (Note: negative values and non-constant expressions 
are allowed as of Hive 0.14.)
* maps: MAP<primitive_type, data_type> (Note: negative values and non-constant 
expressions are allowed as of Hive 0.14.)
* structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
* union: UNIONTYPE<data_type, data_type, ...> (Note: Only available starting 
with Hive 0.7.0.)

Drill's types were, it seems, designed to somewhat follow Hive. As you noted, 
Drill's REPEATED mode mimics Hive's ARRAY type: you can have repeated scalars 
(INT, VARCHAR) as well as REPEATED MAP.

Confusingly, Drill's MAP type is really a Hive STRUCT. Drill does not have a 
true MAP type, but you can simulate one using a Drill MAP if you know the full 
set of keys and value types ahead of time, or can discover them in the first 
record batch during read. (That is, you can use Drill's MAP type if you can 
treat the Hive MAP as if it were a STRUCT.)

This leaves the UNION type. I wonder, how often is it actually used in Hive? Is 
it supported by, say, Impala (no) Spark (?), LLAP (?), Hive-on-Spark (?), or 
Hive-on-MapReduce (?). That is, is the UNION type from Hive actually needed? If 
not, you can save yourself a world of hurt by avoiding the Drill UNION and LIST 
types.

Finally, there is a solution that might help with your project. Over the last 
year we've been adding the oddly-named "result set loader" functionality. It 
provides a set of simple "column writers" that manages vectors for you, along 
with memory management, managing LIST type evolution, and so on. The 
ResultSetLoader itself is in the code today and available for use. (It's 
simpler sibling, RowSetWriter, is used in many tests.)

Moving forward, we are adding a complete scan framework that will translate 
from the schema you provide into the required internal schemas, handling 
projection for you (dropping unwanted columns, adding null columns, etc.)

If you are modifying an existing reader, you may want to stick with the form 
already in the code. You can get all types to work that way except List as List 
does have the major issues listed above.

You specifically asked about List nulls. Lists have three distinct "is set" 
indicators (null bits, AKA "bits vector"): On for each List entry, one for each 
Union entry within the list, and one for each type vector within the union. 
These must all be in agreement. The ResultSetLoader handles all this (took 
forever to figure it out and fix bugs.)

Finally, please consult the JSON reader. It does attempt to use the UNION and 
LIST types. I used it as my reference when doing the work described above. I 
believe it uses an older set of writers modeled on JSON writers. They don't 
handle some of the issues above, but might be good enough to get you going.

Thanks,
- Paul

[1] 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes


 

    On Wednesday, February 6, 2019, 9:46:53 AM PST, Ihor Huzenko 
<[email protected]> wrote:  
 
 Hello Drillers,

I'm currently working on integration with Hive complex types
(DRILL-3290 [1])  and trying to make small POC before publishing
design document. I started from Hive array of INT type, and was able
to use repeated value vector
successfully. But later I realized that our repeated vectors are not
suitable for storing null as array and as element in array.

So then I found ListVector, but all my attempts to write null inside
array row weren't successful. First, I tried to do so using Mutator,
but it doesn't contain any methods for writing elements of array.
After that I tried to use writer obtained
via listVector.getWriter(). Check the code called for each table row:

      IntObjectInspector eoi = (IntObjectInspector)
oi.getListElementObjectInspector();
      List<?> nullableIntList = oi.getList(hiveFieldValue);
      UnionListWriter listWriter = outputVV.getWriter();
      listWriter.startList();
      IntStream.range(0, nullableIntList.size()).forEach(innerIndex -> {
        Object val = nullableIntList.get(innerIndex);
        if (val == null) {
          listWriter.writeNull();
        } else {
          listWriter.writeInt(eoi.get(val));
        }
      });
      listWriter.endList();

And got two problems here: 1) listWriter.writeNull() throws exception
immediately 2) listWriter.writeInt(eoi.get(val)) writes all rows
arrays into one first row array.
I'd apppreciate if you guys could give me a suggestion about how to
use or where to look for correct usages of the ListVector ?

[1] https://issues.apache.org/jira/browse/DRILL-3290

Thank you in advance,
Igor Guzenko

Re: Problem of using ListVector for representing Hive arrays

Reply via email to