Hello Paul, Thank you very much for such informative response. I haven't investigated yet what we have to offer for UNION type, and whether it's supported by mentioned tools. I'm happy that I have option to use result set framework and I'll definitely consider the option, because previously I saw it in code but wasn't sure that it is finished and may be used. For ARRAY most probably I will have to go into world of pain with ListVector, because nulls may be everywhere. Integration with MAP may again become a problem when LIST will be used for values. JSON reader I think is a good starting point and before making any changes I'll publish design document link in the DRILL-3290 and ask community to join discussion.
Thanks, Igor On Wed, Feb 6, 2019 at 8:34 PM Paul Rogers <[email protected]> wrote: > > Hi Igor, > > Hive complex type integration will be a valuable addition to Drill. You > mentioned running into issues with List vector. I believe you will find that > you'll encounter four separate issues. > > First, the List vector is "experimental": the core functionality exists, but > there are holes. List vector is semi-supported by the JSON reader, but not by > downstream operators. Thus, even if you can create a list, your query may > fail depending on the complexity of the query and the particular operators > added to the plan. > > Second, the List vector is very complex. It starts as a list of nulls, then > morphs to a list of a single type, then morphs again to a list of Union type, > with multiple type vectors hanging off of it. The Union type itself is also > "experimental" and not well supported in Drill; again some operators don't > support it. > > Third, working with the List vector is very complex because of the fact that > it changes structure and uses Union. Getting memory management correct is a > challenge. > > Fourth, every reader in Drill should control the size of its output record > batch to avoid a number of memory-related issues. For example, you'll want to > ensure that your batches are limited to, say, 10-20 MB in size and that no > one vector is larger than 16 MB. This is quite hard when the only tool you > have is batch record count. Fortunately, in your case, you have Hive stats > such as average column width so that you can create a good > per-batch-row-count estimate. > > Now that we've explained the challenges you face, let's discuss some > solutions. First, do you actually need LIST (and UNION)? The Hive complex > types are listed in [1] as follows: > > Complex Types > * arrays: ARRAY<data_type> (Note: negative values and non-constant > expressions are allowed as of Hive 0.14.) > * maps: MAP<primitive_type, data_type> (Note: negative values and > non-constant expressions are allowed as of Hive 0.14.) > * structs: STRUCT<col_name : data_type [COMMENT col_comment], ...> > * union: UNIONTYPE<data_type, data_type, ...> (Note: Only available starting > with Hive 0.7.0.) > > Drill's types were, it seems, designed to somewhat follow Hive. As you noted, > Drill's REPEATED mode mimics Hive's ARRAY type: you can have repeated scalars > (INT, VARCHAR) as well as REPEATED MAP. > > Confusingly, Drill's MAP type is really a Hive STRUCT. Drill does not have a > true MAP type, but you can simulate one using a Drill MAP if you know the > full set of keys and value types ahead of time, or can discover them in the > first record batch during read. (That is, you can use Drill's MAP type if you > can treat the Hive MAP as if it were a STRUCT.) > > This leaves the UNION type. I wonder, how often is it actually used in Hive? > Is it supported by, say, Impala (no) Spark (?), LLAP (?), Hive-on-Spark (?), > or Hive-on-MapReduce (?). That is, is the UNION type from Hive actually > needed? If not, you can save yourself a world of hurt by avoiding the Drill > UNION and LIST types. > > Finally, there is a solution that might help with your project. Over the last > year we've been adding the oddly-named "result set loader" functionality. It > provides a set of simple "column writers" that manages vectors for you, along > with memory management, managing LIST type evolution, and so on. The > ResultSetLoader itself is in the code today and available for use. (It's > simpler sibling, RowSetWriter, is used in many tests.) > > Moving forward, we are adding a complete scan framework that will translate > from the schema you provide into the required internal schemas, handling > projection for you (dropping unwanted columns, adding null columns, etc.) > > If you are modifying an existing reader, you may want to stick with the form > already in the code. You can get all types to work that way except List as > List does have the major issues listed above. > > You specifically asked about List nulls. Lists have three distinct "is set" > indicators (null bits, AKA "bits vector"): On for each List entry, one for > each Union entry within the list, and one for each type vector within the > union. These must all be in agreement. The ResultSetLoader handles all this > (took forever to figure it out and fix bugs.) > > Finally, please consult the JSON reader. It does attempt to use the UNION and > LIST types. I used it as my reference when doing the work described above. I > believe it uses an older set of writers modeled on JSON writers. They don't > handle some of the issues above, but might be good enough to get you going. > > Thanks, > - Paul > > [1] > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes > > > > > On Wednesday, February 6, 2019, 9:46:53 AM PST, Ihor Huzenko > <[email protected]> wrote: > > Hello Drillers, > > I'm currently working on integration with Hive complex types > (DRILL-3290 [1]) and trying to make small POC before publishing > design document. I started from Hive array of INT type, and was able > to use repeated value vector > successfully. But later I realized that our repeated vectors are not > suitable for storing null as array and as element in array. > > So then I found ListVector, but all my attempts to write null inside > array row weren't successful. First, I tried to do so using Mutator, > but it doesn't contain any methods for writing elements of array. > After that I tried to use writer obtained > via listVector.getWriter(). Check the code called for each table row: > > IntObjectInspector eoi = (IntObjectInspector) > oi.getListElementObjectInspector(); > List<?> nullableIntList = oi.getList(hiveFieldValue); > UnionListWriter listWriter = outputVV.getWriter(); > listWriter.startList(); > IntStream.range(0, nullableIntList.size()).forEach(innerIndex -> { > Object val = nullableIntList.get(innerIndex); > if (val == null) { > listWriter.writeNull(); > } else { > listWriter.writeInt(eoi.get(val)); > } > }); > listWriter.endList(); > > And got two problems here: 1) listWriter.writeNull() throws exception > immediately 2) listWriter.writeInt(eoi.get(val)) writes all rows > arrays into one first row array. > I'd apppreciate if you guys could give me a suggestion about how to > use or where to look for correct usages of the ListVector ? > > [1] https://issues.apache.org/jira/browse/DRILL-3290 > > Thank you in advance, > Igor Guzenko >
