Hi Igor, Hive complex type integration will be a valuable addition to Drill. You mentioned running into issues with List vector. I believe you will find that you'll encounter four separate issues.
First, the List vector is "experimental": the core functionality exists, but there are holes. List vector is semi-supported by the JSON reader, but not by downstream operators. Thus, even if you can create a list, your query may fail depending on the complexity of the query and the particular operators added to the plan. Second, the List vector is very complex. It starts as a list of nulls, then morphs to a list of a single type, then morphs again to a list of Union type, with multiple type vectors hanging off of it. The Union type itself is also "experimental" and not well supported in Drill; again some operators don't support it. Third, working with the List vector is very complex because of the fact that it changes structure and uses Union. Getting memory management correct is a challenge. Fourth, every reader in Drill should control the size of its output record batch to avoid a number of memory-related issues. For example, you'll want to ensure that your batches are limited to, say, 10-20 MB in size and that no one vector is larger than 16 MB. This is quite hard when the only tool you have is batch record count. Fortunately, in your case, you have Hive stats such as average column width so that you can create a good per-batch-row-count estimate. Now that we've explained the challenges you face, let's discuss some solutions. First, do you actually need LIST (and UNION)? The Hive complex types are listed in [1] as follows: Complex Types * arrays: ARRAY<data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.) * maps: MAP<primitive_type, data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.) * structs: STRUCT<col_name : data_type [COMMENT col_comment], ...> * union: UNIONTYPE<data_type, data_type, ...> (Note: Only available starting with Hive 0.7.0.) Drill's types were, it seems, designed to somewhat follow Hive. As you noted, Drill's REPEATED mode mimics Hive's ARRAY type: you can have repeated scalars (INT, VARCHAR) as well as REPEATED MAP. Confusingly, Drill's MAP type is really a Hive STRUCT. Drill does not have a true MAP type, but you can simulate one using a Drill MAP if you know the full set of keys and value types ahead of time, or can discover them in the first record batch during read. (That is, you can use Drill's MAP type if you can treat the Hive MAP as if it were a STRUCT.) This leaves the UNION type. I wonder, how often is it actually used in Hive? Is it supported by, say, Impala (no) Spark (?), LLAP (?), Hive-on-Spark (?), or Hive-on-MapReduce (?). That is, is the UNION type from Hive actually needed? If not, you can save yourself a world of hurt by avoiding the Drill UNION and LIST types. Finally, there is a solution that might help with your project. Over the last year we've been adding the oddly-named "result set loader" functionality. It provides a set of simple "column writers" that manages vectors for you, along with memory management, managing LIST type evolution, and so on. The ResultSetLoader itself is in the code today and available for use. (It's simpler sibling, RowSetWriter, is used in many tests.) Moving forward, we are adding a complete scan framework that will translate from the schema you provide into the required internal schemas, handling projection for you (dropping unwanted columns, adding null columns, etc.) If you are modifying an existing reader, you may want to stick with the form already in the code. You can get all types to work that way except List as List does have the major issues listed above. You specifically asked about List nulls. Lists have three distinct "is set" indicators (null bits, AKA "bits vector"): On for each List entry, one for each Union entry within the list, and one for each type vector within the union. These must all be in agreement. The ResultSetLoader handles all this (took forever to figure it out and fix bugs.) Finally, please consult the JSON reader. It does attempt to use the UNION and LIST types. I used it as my reference when doing the work described above. I believe it uses an older set of writers modeled on JSON writers. They don't handle some of the issues above, but might be good enough to get you going. Thanks, - Paul [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes On Wednesday, February 6, 2019, 9:46:53 AM PST, Ihor Huzenko <[email protected]> wrote: Hello Drillers, I'm currently working on integration with Hive complex types (DRILL-3290 [1]) and trying to make small POC before publishing design document. I started from Hive array of INT type, and was able to use repeated value vector successfully. But later I realized that our repeated vectors are not suitable for storing null as array and as element in array. So then I found ListVector, but all my attempts to write null inside array row weren't successful. First, I tried to do so using Mutator, but it doesn't contain any methods for writing elements of array. After that I tried to use writer obtained via listVector.getWriter(). Check the code called for each table row: IntObjectInspector eoi = (IntObjectInspector) oi.getListElementObjectInspector(); List<?> nullableIntList = oi.getList(hiveFieldValue); UnionListWriter listWriter = outputVV.getWriter(); listWriter.startList(); IntStream.range(0, nullableIntList.size()).forEach(innerIndex -> { Object val = nullableIntList.get(innerIndex); if (val == null) { listWriter.writeNull(); } else { listWriter.writeInt(eoi.get(val)); } }); listWriter.endList(); And got two problems here: 1) listWriter.writeNull() throws exception immediately 2) listWriter.writeInt(eoi.get(val)) writes all rows arrays into one first row array. I'd apppreciate if you guys could give me a suggestion about how to use or where to look for correct usages of the ListVector ? [1] https://issues.apache.org/jira/browse/DRILL-3290 Thank you in advance, Igor Guzenko
