[ https://issues.apache.org/jira/browse/DRILL-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534613#comment-17534613 ]
ASF GitHub Bot commented on DRILL-5955: --------------------------------------- cgivre commented on PR #2543: URL: https://github.com/apache/drill/pull/2543#issuecomment-1123010372 @vdiravka This is looking good. Quick question, can we add a configuration variable to the JSON Format to enable/disable this at the plugin instance level? > Revisit Union Vectors > --------------------- > > Key: DRILL-5955 > URL: https://issues.apache.org/jira/browse/DRILL-5955 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.11.0 > Reporter: Paul Rogers > Priority: Major > > Drill supports a “Union Vector” type that allows a single column to hold > values of multiple types. Conceptually, each column value is a (type, value) > pair. For example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL > value. > The name refers to a C “union” in which the same bit of memory is used to > represent one of a set of defined types. > Drill implements the union vector a bit like a map: as a collection of typed > vectors. Each value is keyed by type. The result is that a union vector is > more like a C “struct” than a C “union”: every vector takes space, but only > one of the vectors is used for each row. For the example above, the union > vector contains an Int vector, a Varchar vector and a type vector. For each > row, either the Int or the Varchar is used. For NULL values, neither vector > is used. > h4. Memory Footprint Concerns > The current representation, despite its name, makes very inefficient use of > memory because it requires the sum of the storage for each included type. > (That is, if we store 1000 rows, we need 1000 slots for integers, another > 1000 for Varchar and yet another 1000 for the type vector.) > Drill poorly supports the union type. One operator that does support it is > the sort. If the union type is enabled, and the sort sees a schema change, > the sort will create a new union vector that combines the two types. The > result is a sudden, unplanned increase in memory usage. Since the sort can > buffer many hundreds of batches, this unplanned memory increase can cause the > sort to run out of memory. > h4. Muddy Semantics > The union vector is closely tied with the List vector: a list vector is, > essentially, an array of unions. (See DRILL-5958). The list type is used to > model JSON in which a list can hold anything: another list, an object or > scalars. For this reason, the union vector also can hold any type. And, > indeed, it can hold a union of any of these types: a Map and an Int, or a > List and a Map. > Drill is a relational, SQL-based tool. Work is required to bring > non-relational structures into Drill. As discussed below, a union of scalars > can be made to work. But, a union of structured types (lists, arrays or Maps) > makes no sense. > h4. High Complexity > The union vector, as implemented is quite complex. It contains member > variables for every other vector type (except, strangely, the decimal types.) > Access to typed members is by type-specific methods, meaning that the client > code must include a separate call for every type, resulting in very complex > client code. > The complexity allowed the union type to be made to work, but causes this one > type to consume a disproportionate amount of the vector and client code. > h4. Proposed Revision to Structure: The Variant Type > Given the above, we can now present the proposed changes. First let us > recognize that a union vector need not hold structured types; there are other > solutions as discussed in DRILL-xxxx. This leaves the union vector to hold > just scalars. > h4. Proposed Revision to Storage > This in turn lets us adopt the [Variant > type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in > Visual Basic. Variant “is a tagged union that can be used to represent any > other data type”. The Variant type was designed to be compact by building on > the idea of a tagged union in C. > {code} > struct { > int tag; // type > union { > int intValue; > long longValue; > … > } > } > {code} > When implemented as a vector, the format could consume just a single > variable-width vector with each entry of the form: {{\[type value]}}. The > vector is simply a sequence of these (type, value) pairs. > The type is a single-byte that encodes the MinorType that describes the > value. That is, the type byte is like the existing type vector, but stored in > the same location as the data. The data is simply the serialized format of > data. (Four bytes for an Int, 8 bytes for a Float8 and so on.) > Variable-width types require an extra field: the type field: {{\[type length > value]}}. For example, a Varchar would be encoded as {{\[Varchar 27 > byte0-26]}}. > A writer uses the type to drive the serialization. A reader uses the type to > drive deserialization. > Note that the type field must include a special marker for nulls. Today, the > union type uses 0 to indicate a null value. (Note that, in a union and > variant, a null value is not a null of some type, both the type and value are > null.) That form should be used in the variant representation as well. But, > note that the 0 value in the MajorType enum is not Null but is instead Late. > This is an unpleasant messiness that the union (and variant )encoding must > handle. > An offset vector provides the location of each field, as is done with > variable-length vectors today. > The result is huge compaction of space requirements from multiple vectors per > type to just two vectors (offsets and data.) > Such a change would be daunting if clients work directly with vectors. > However, with the introduction of the “result set loader” and “reader” > abstractions, this change in format would be completely hidden from client > code. The “result set” abstractions provide high level APIs that isolate > clients from implementation, allowing changes such as this. > h4. Arrow Union Types > [Arrow|https://arrow.apache.org/docs/memory_layout.html] (see “Dense union > type”) has retained Drill’s union vector design: it contains: > * One child array for each relative type > * Types buffer… > * Offsets buffer… > Unlike Drill, Arrow also has a “Sparse union type” that omits the offsets > buffer if the child types are all of the same length. > The variant type is an opportunity for Drill to lead based on our extensive > experience with vectors in production systems. Once the variant type is > proven in production, we can offer it to Arrow as part of the Drill/Arrow > integration. > h4. Backward Compatibility > While the actual code change is quite straightforward, the far larger > challenge is backward compatibility. Drill offers both JDBC and ODBC drivers. > These drivers make use of Drills internal vector storage format. Thus, any > change to the vector format will appear on the wire and must be understood by > these clients. > Drill does not, unfortunately, provide a versioned API to deal with these > issues. See DRILL-5957 for a proposed solution to the version negotiation > problem. > For the union vector, let’s say the variant alternative is introduced in > version Y. If a version X (older) client connects, the server converts the > variant type to union format before sending to the client. > Thus, before we can change the union vector (or, for that matter, any > vector), we must release clients that understand the version handshake > protocol. Then, once those clients are deployed, a following server version > can make the vector changes. > Note that this same issue will arise (only in much more complex form) if > Drill were to adopt Arrow. > h4. Seed of a Row-Based Storage Format > Drill is a columnar engine. However, there are a few situations in which a > row-based storage format would improve Drill performance and/or simplicity: > * JDBC and ODBC clients work with vectors today, but would prefer to work > with rows. (The drivers contain complex code to do the column-to-row rotation > on the client.) > * Hash exchanges broadcast each row to a different host, but today do that by > buffering rows until gathering a large enough batch to send, causing severe > memory pressure. Row-by-row sending would be faster and more memory efficient. > If the variant format were to be available, a simple extension is to use the > same encoding for a row format. > * An offset vector, indexed by column, gives the start location of each > column. > * The row buffer is a sequence of (type, value) pairs (for fixed-width) or > (type, length, value) triples (for variable-width types.) > The same encoder/decoder that handles a column of heterogeneous values could > also handle the same structure that represents a row of such values. > h4. SQL-level Variant Semantics > The union vector (and the proposed new “variant” vector) exist to hold a > variety of types. However, SQL is designed to work with just a single type. > Therefore, we must consider not just storage representation, but also query > semantics. > A challenge is that neither JDBC nor ODBC were designed for variants, nor do > most analytic tools know how to interpret varying data types. Indeed, since > these APIs and tools are designed for relational data (in which the type of > each column is known and fixed), then it is the job of the query tool to > determine the column type. > This means that, when using JDBC and ODBC, all union/variant processing must > be done within Drill itself, with the client seining a single, combined > output type after some internal operation to produce that combined type. > One simple use case is to handle type schema changes within an input. For > example, in JSON, a value might first present as an Integer, later as a > Float. Or, a value might start small enough for a Float, but later present > values that require a BigDecimal. > In such cases, a variant type allows Drill to hold values that correspond to > how the JSON parser retrieved the values. > To use those values in SQL, however, the user must unify them, perhaps with a > Cast. For example, in the mixed-number case above, the user might cast the > column to a decimal. > h4. Alternatives to the Union/Variant Types > Here, however, we can take a step back and ask a larger question. If the > union/variant vector is to handle schema changes, might it be better to > simply push the final schema down to the reader, and simply interpret the > data as the final value at read time? That is, if we could tell the JSON > reader (say) that column “x” is a Decimal, then the reader can do the > conversion, saving all the complexity of a union (or variant) vector and > casting. > One way to do this is to “push” cast operations into the reader by providing > the reader not just column names, but with the types as well. That is, > projected columns are not just names, they are (name, type) pairs. > The above cannot solve the {{SELECT *}} case, however, as the user has chosen > not to specify names (let alone types.) > A more general solution is to allow the user to specify the column types as > metadata (as is already done in all other query tools, perhaps via Hive.) > Then, the user need not specify the types via casts in each query. Because > the types are known at read time, {{SELECT *}} works fine. As a result, the > need for a union/variant never arises. > Here it is worth pointing out that Drill must still be able to query data > without a schema. But, type conflicts may appear since Drill can’t predict > the future. The user than makes a decision that the easiest path forward for > their own use case is to 1) live with the issue, 2) add casts to each query, > 3) add casts to a per-file view, or 4) provide metadata that solves the > problem once and for all. > Given this there other cases where we actually do need the union type? Do we > have compelling use cases? If not, then the best path forward to fix the > union type is simply to retire it in favor of the type hints described above. -- This message was sent by Atlassian Jira (v8.20.7#820007)