Re: Problem of using ListVector for representing Hive arrays

Ihor Huzenko Thu, 07 Feb 2019 02:42:40 -0800

Hello Paul,

Thank you very much for such informative response. I haven't
investigated yet what we have to offer for UNION type, and whether
it's supported by mentioned tools. I'm happy that I have option to use
result set framework and I'll definitely consider the option, because
previously I saw it in code but wasn't sure that it is finished and
may be used. For ARRAY most probably I will have to go into world of
pain with ListVector, because nulls may be everywhere. Integration
with MAP may again become a problem when LIST will be used for values.
JSON reader  I think is a good starting point and before making any
changes I'll publish design document link in the DRILL-3290 and ask
community to join discussion.


Thanks, Igor


On Wed, Feb 6, 2019 at 8:34 PM Paul Rogers <[email protected]> wrote:
>
> Hi Igor,
>
> Hive complex type integration will be a valuable addition to Drill. You 
> mentioned running into issues with List vector. I believe you will find that 
> you'll encounter four separate issues.
>
> First, the List vector is "experimental": the core functionality exists, but 
> there are holes. List vector is semi-supported by the JSON reader, but not by 
> downstream operators. Thus, even if you can create a list, your query may 
> fail depending on the complexity of the query and the particular operators 
> added to the plan.
>
> Second, the List vector is very complex. It starts as a list of nulls, then 
> morphs to a list of a single type, then morphs again to a list of Union type, 
> with multiple type vectors hanging off of it. The Union type itself is also 
> "experimental" and not well supported in Drill; again some operators don't 
> support it.
>
> Third, working with the List vector is very complex because of the fact that 
> it changes structure and uses Union. Getting memory management correct is a 
> challenge.
>
> Fourth, every reader in Drill should control the size of its output record 
> batch to avoid a number of memory-related issues. For example, you'll want to 
> ensure that your batches are limited to, say, 10-20 MB in size and that no 
> one vector is larger than 16 MB. This is quite hard when the only tool you 
> have is batch record count. Fortunately, in your case, you have Hive stats 
> such as average column width so that you can create a good 
> per-batch-row-count estimate.
>
> Now that we've explained the challenges you face, let's discuss some 
> solutions. First, do you actually need LIST (and UNION)? The Hive complex 
> types are listed in [1] as follows:
>
> Complex Types
> * arrays: ARRAY<data_type> (Note: negative values and non-constant 
> expressions are allowed as of Hive 0.14.)
> * maps: MAP<primitive_type, data_type> (Note: negative values and 
> non-constant expressions are allowed as of Hive 0.14.)
> * structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
> * union: UNIONTYPE<data_type, data_type, ...> (Note: Only available starting 
> with Hive 0.7.0.)
>
> Drill's types were, it seems, designed to somewhat follow Hive. As you noted, 
> Drill's REPEATED mode mimics Hive's ARRAY type: you can have repeated scalars 
> (INT, VARCHAR) as well as REPEATED MAP.
>
> Confusingly, Drill's MAP type is really a Hive STRUCT. Drill does not have a 
> true MAP type, but you can simulate one using a Drill MAP if you know the 
> full set of keys and value types ahead of time, or can discover them in the 
> first record batch during read. (That is, you can use Drill's MAP type if you 
> can treat the Hive MAP as if it were a STRUCT.)
>
> This leaves the UNION type. I wonder, how often is it actually used in Hive? 
> Is it supported by, say, Impala (no) Spark (?), LLAP (?), Hive-on-Spark (?), 
> or Hive-on-MapReduce (?). That is, is the UNION type from Hive actually 
> needed? If not, you can save yourself a world of hurt by avoiding the Drill 
> UNION and LIST types.
>
> Finally, there is a solution that might help with your project. Over the last 
> year we've been adding the oddly-named "result set loader" functionality. It 
> provides a set of simple "column writers" that manages vectors for you, along 
> with memory management, managing LIST type evolution, and so on. The 
> ResultSetLoader itself is in the code today and available for use. (It's 
> simpler sibling, RowSetWriter, is used in many tests.)
>
> Moving forward, we are adding a complete scan framework that will translate 
> from the schema you provide into the required internal schemas, handling 
> projection for you (dropping unwanted columns, adding null columns, etc.)
>
> If you are modifying an existing reader, you may want to stick with the form 
> already in the code. You can get all types to work that way except List as 
> List does have the major issues listed above.
>
> You specifically asked about List nulls. Lists have three distinct "is set" 
> indicators (null bits, AKA "bits vector"): On for each List entry, one for 
> each Union entry within the list, and one for each type vector within the 
> union. These must all be in agreement. The ResultSetLoader handles all this 
> (took forever to figure it out and fix bugs.)
>
> Finally, please consult the JSON reader. It does attempt to use the UNION and 
> LIST types. I used it as my reference when doing the work described above. I 
> believe it uses an older set of writers modeled on JSON writers. They don't 
> handle some of the issues above, but might be good enough to get you going.
>
> Thanks,
> - Paul
>
> [1] 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes
>
>
>
>
>     On Wednesday, February 6, 2019, 9:46:53 AM PST, Ihor Huzenko 
> <[email protected]> wrote:
>
>  Hello Drillers,
>
> I'm currently working on integration with Hive complex types
> (DRILL-3290 [1])  and trying to make small POC before publishing
> design document. I started from Hive array of INT type, and was able
> to use repeated value vector
> successfully. But later I realized that our repeated vectors are not
> suitable for storing null as array and as element in array.
>
> So then I found ListVector, but all my attempts to write null inside
> array row weren't successful. First, I tried to do so using Mutator,
> but it doesn't contain any methods for writing elements of array.
> After that I tried to use writer obtained
> via listVector.getWriter(). Check the code called for each table row:
>
>       IntObjectInspector eoi = (IntObjectInspector)
> oi.getListElementObjectInspector();
>       List<?> nullableIntList = oi.getList(hiveFieldValue);
>       UnionListWriter listWriter = outputVV.getWriter();
>       listWriter.startList();
>       IntStream.range(0, nullableIntList.size()).forEach(innerIndex -> {
>         Object val = nullableIntList.get(innerIndex);
>         if (val == null) {
>           listWriter.writeNull();
>         } else {
>           listWriter.writeInt(eoi.get(val));
>         }
>       });
>       listWriter.endList();
>
> And got two problems here: 1) listWriter.writeNull() throws exception
> immediately 2) listWriter.writeInt(eoi.get(val)) writes all rows
> arrays into one first row array.
> I'd apppreciate if you guys could give me a suggestion about how to
> use or where to look for correct usages of the ListVector ?
>
> [1] https://issues.apache.org/jira/browse/DRILL-3290
>
> Thank you in advance,
> Igor Guzenko
>

Re: Problem of using ListVector for representing Hive arrays

Reply via email to