[
https://issues.apache.org/jira/browse/DRILL-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charles Givre reassigned DRILL-5955:
------------------------------------
Assignee: Vitalii Diravka
> Revisit Union Vectors
> ---------------------
>
> Key: DRILL-5955
> URL: https://issues.apache.org/jira/browse/DRILL-5955
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.11.0
> Reporter: Paul Rogers
> Assignee: Vitalii Diravka
> Priority: Major
>
> Drill supports a “Union Vector” type that allows a single column to hold
> values of multiple types. Conceptually, each column value is a (type, value)
> pair. For example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL
> value.
> The name refers to a C “union” in which the same bit of memory is used to
> represent one of a set of defined types.
> Drill implements the union vector a bit like a map: as a collection of typed
> vectors. Each value is keyed by type. The result is that a union vector is
> more like a C “struct” than a C “union”: every vector takes space, but only
> one of the vectors is used for each row. For the example above, the union
> vector contains an Int vector, a Varchar vector and a type vector. For each
> row, either the Int or the Varchar is used. For NULL values, neither vector
> is used.
> h4. Memory Footprint Concerns
> The current representation, despite its name, makes very inefficient use of
> memory because it requires the sum of the storage for each included type.
> (That is, if we store 1000 rows, we need 1000 slots for integers, another
> 1000 for Varchar and yet another 1000 for the type vector.)
> Drill poorly supports the union type. One operator that does support it is
> the sort. If the union type is enabled, and the sort sees a schema change,
> the sort will create a new union vector that combines the two types. The
> result is a sudden, unplanned increase in memory usage. Since the sort can
> buffer many hundreds of batches, this unplanned memory increase can cause the
> sort to run out of memory.
> h4. Muddy Semantics
> The union vector is closely tied with the List vector: a list vector is,
> essentially, an array of unions. (See DRILL-5958). The list type is used to
> model JSON in which a list can hold anything: another list, an object or
> scalars. For this reason, the union vector also can hold any type. And,
> indeed, it can hold a union of any of these types: a Map and an Int, or a
> List and a Map.
> Drill is a relational, SQL-based tool. Work is required to bring
> non-relational structures into Drill. As discussed below, a union of scalars
> can be made to work. But, a union of structured types (lists, arrays or Maps)
> makes no sense.
> h4. High Complexity
> The union vector, as implemented is quite complex. It contains member
> variables for every other vector type (except, strangely, the decimal types.)
> Access to typed members is by type-specific methods, meaning that the client
> code must include a separate call for every type, resulting in very complex
> client code.
> The complexity allowed the union type to be made to work, but causes this one
> type to consume a disproportionate amount of the vector and client code.
> h4. Proposed Revision to Structure: The Variant Type
> Given the above, we can now present the proposed changes. First let us
> recognize that a union vector need not hold structured types; there are other
> solutions as discussed in DRILL-xxxx. This leaves the union vector to hold
> just scalars.
> h4. Proposed Revision to Storage
> This in turn lets us adopt the [Variant
> type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in
> Visual Basic. Variant “is a tagged union that can be used to represent any
> other data type”. The Variant type was designed to be compact by building on
> the idea of a tagged union in C.
> {code}
> struct {
> int tag; // type
> union {
> int intValue;
> long longValue;
> …
> }
> }
> {code}
> When implemented as a vector, the format could consume just a single
> variable-width vector with each entry of the form: {{\[type value]}}. The
> vector is simply a sequence of these (type, value) pairs.
> The type is a single-byte that encodes the MinorType that describes the
> value. That is, the type byte is like the existing type vector, but stored in
> the same location as the data. The data is simply the serialized format of
> data. (Four bytes for an Int, 8 bytes for a Float8 and so on.)
> Variable-width types require an extra field: the type field: {{\[type length
> value]}}. For example, a Varchar would be encoded as {{\[Varchar 27
> byte0-26]}}.
> A writer uses the type to drive the serialization. A reader uses the type to
> drive deserialization.
> Note that the type field must include a special marker for nulls. Today, the
> union type uses 0 to indicate a null value. (Note that, in a union and
> variant, a null value is not a null of some type, both the type and value are
> null.) That form should be used in the variant representation as well. But,
> note that the 0 value in the MajorType enum is not Null but is instead Late.
> This is an unpleasant messiness that the union (and variant )encoding must
> handle.
> An offset vector provides the location of each field, as is done with
> variable-length vectors today.
> The result is huge compaction of space requirements from multiple vectors per
> type to just two vectors (offsets and data.)
> Such a change would be daunting if clients work directly with vectors.
> However, with the introduction of the “result set loader” and “reader”
> abstractions, this change in format would be completely hidden from client
> code. The “result set” abstractions provide high level APIs that isolate
> clients from implementation, allowing changes such as this.
> h4. Arrow Union Types
> [Arrow|https://arrow.apache.org/docs/memory_layout.html] (see “Dense union
> type”) has retained Drill’s union vector design: it contains:
> * One child array for each relative type
> * Types buffer…
> * Offsets buffer…
> Unlike Drill, Arrow also has a “Sparse union type” that omits the offsets
> buffer if the child types are all of the same length.
> The variant type is an opportunity for Drill to lead based on our extensive
> experience with vectors in production systems. Once the variant type is
> proven in production, we can offer it to Arrow as part of the Drill/Arrow
> integration.
> h4. Backward Compatibility
> While the actual code change is quite straightforward, the far larger
> challenge is backward compatibility. Drill offers both JDBC and ODBC drivers.
> These drivers make use of Drills internal vector storage format. Thus, any
> change to the vector format will appear on the wire and must be understood by
> these clients.
> Drill does not, unfortunately, provide a versioned API to deal with these
> issues. See DRILL-5957 for a proposed solution to the version negotiation
> problem.
> For the union vector, let’s say the variant alternative is introduced in
> version Y. If a version X (older) client connects, the server converts the
> variant type to union format before sending to the client.
> Thus, before we can change the union vector (or, for that matter, any
> vector), we must release clients that understand the version handshake
> protocol. Then, once those clients are deployed, a following server version
> can make the vector changes.
> Note that this same issue will arise (only in much more complex form) if
> Drill were to adopt Arrow.
> h4. Seed of a Row-Based Storage Format
> Drill is a columnar engine. However, there are a few situations in which a
> row-based storage format would improve Drill performance and/or simplicity:
> * JDBC and ODBC clients work with vectors today, but would prefer to work
> with rows. (The drivers contain complex code to do the column-to-row rotation
> on the client.)
> * Hash exchanges broadcast each row to a different host, but today do that by
> buffering rows until gathering a large enough batch to send, causing severe
> memory pressure. Row-by-row sending would be faster and more memory efficient.
> If the variant format were to be available, a simple extension is to use the
> same encoding for a row format.
> * An offset vector, indexed by column, gives the start location of each
> column.
> * The row buffer is a sequence of (type, value) pairs (for fixed-width) or
> (type, length, value) triples (for variable-width types.)
> The same encoder/decoder that handles a column of heterogeneous values could
> also handle the same structure that represents a row of such values.
> h4. SQL-level Variant Semantics
> The union vector (and the proposed new “variant” vector) exist to hold a
> variety of types. However, SQL is designed to work with just a single type.
> Therefore, we must consider not just storage representation, but also query
> semantics.
> A challenge is that neither JDBC nor ODBC were designed for variants, nor do
> most analytic tools know how to interpret varying data types. Indeed, since
> these APIs and tools are designed for relational data (in which the type of
> each column is known and fixed), then it is the job of the query tool to
> determine the column type.
> This means that, when using JDBC and ODBC, all union/variant processing must
> be done within Drill itself, with the client seining a single, combined
> output type after some internal operation to produce that combined type.
> One simple use case is to handle type schema changes within an input. For
> example, in JSON, a value might first present as an Integer, later as a
> Float. Or, a value might start small enough for a Float, but later present
> values that require a BigDecimal.
> In such cases, a variant type allows Drill to hold values that correspond to
> how the JSON parser retrieved the values.
> To use those values in SQL, however, the user must unify them, perhaps with a
> Cast. For example, in the mixed-number case above, the user might cast the
> column to a decimal.
> h4. Alternatives to the Union/Variant Types
> Here, however, we can take a step back and ask a larger question. If the
> union/variant vector is to handle schema changes, might it be better to
> simply push the final schema down to the reader, and simply interpret the
> data as the final value at read time? That is, if we could tell the JSON
> reader (say) that column “x” is a Decimal, then the reader can do the
> conversion, saving all the complexity of a union (or variant) vector and
> casting.
> One way to do this is to “push” cast operations into the reader by providing
> the reader not just column names, but with the types as well. That is,
> projected columns are not just names, they are (name, type) pairs.
> The above cannot solve the {{SELECT *}} case, however, as the user has chosen
> not to specify names (let alone types.)
> A more general solution is to allow the user to specify the column types as
> metadata (as is already done in all other query tools, perhaps via Hive.)
> Then, the user need not specify the types via casts in each query. Because
> the types are known at read time, {{SELECT *}} works fine. As a result, the
> need for a union/variant never arises.
> Here it is worth pointing out that Drill must still be able to query data
> without a schema. But, type conflicts may appear since Drill can’t predict
> the future. The user than makes a decision that the easiest path forward for
> their own use case is to 1) live with the issue, 2) add casts to each query,
> 3) add casts to a per-file view, or 4) provide metadata that solves the
> problem once and for all.
> Given this there other cases where we actually do need the union type? Do we
> have compelling use cases? If not, then the best path forward to fix the
> union type is simply to retire it in favor of the type hints described above.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)