[ 
https://issues.apache.org/jira/browse/DRILL-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534613#comment-17534613
 ] 

ASF GitHub Bot commented on DRILL-5955:
---------------------------------------

cgivre commented on PR #2543:
URL: https://github.com/apache/drill/pull/2543#issuecomment-1123010372

   @vdiravka This is looking good.  Quick question, can we add a configuration 
variable to the JSON Format to enable/disable this at the plugin instance level?




> Revisit Union Vectors
> ---------------------
>
>                 Key: DRILL-5955
>                 URL: https://issues.apache.org/jira/browse/DRILL-5955
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Drill supports a “Union Vector” type that allows a single column to hold 
> values of multiple types. Conceptually, each column value is a (type, value) 
> pair. For example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL 
> value.
> The name refers to a C “union” in which the same bit of memory is used to 
> represent one of a set of defined types.
> Drill implements the union vector a bit like a map: as a collection of typed 
> vectors. Each value is keyed by type. The result is that a union vector is 
> more like a C “struct” than a C “union”: every vector takes space, but only 
> one of the vectors is used for each row. For the example above, the union 
> vector contains an Int vector, a Varchar vector and a type vector. For each 
> row, either the Int or the Varchar is used. For NULL values, neither vector 
> is used.
> h4. Memory Footprint Concerns
> The current representation, despite its name, makes very inefficient use of 
> memory because it requires the sum of the storage for each included type. 
> (That is, if we store 1000 rows, we need 1000 slots for integers, another 
> 1000 for Varchar and yet another 1000 for the type vector.)
> Drill poorly supports the union type. One operator that does support it is 
> the sort. If the union type is enabled, and the sort sees a schema change, 
> the sort will create a new union vector that combines the two types. The 
> result is a sudden, unplanned increase in memory usage. Since the sort can 
> buffer many hundreds of batches, this unplanned memory increase can cause the 
> sort to run out of memory.
> h4. Muddy Semantics
> The union vector is closely tied with the List vector: a list vector is, 
> essentially, an array of unions. (See DRILL-5958). The list type is used to 
> model JSON in which a list can hold anything: another list, an object or 
> scalars. For this reason, the union vector also can hold any type. And, 
> indeed, it can hold a union of any of these types: a Map and an Int, or a 
> List and a Map.
> Drill is a relational, SQL-based tool. Work is required to bring 
> non-relational structures into Drill. As discussed below, a union of scalars 
> can be made to work. But, a union of structured types (lists, arrays or Maps) 
> makes no sense.
> h4. High Complexity
> The union vector, as implemented is quite complex. It contains member 
> variables for every other vector type (except, strangely, the decimal types.) 
> Access to typed members is by type-specific methods, meaning that the client 
> code must include a separate call for every type, resulting in very complex 
> client code.
> The complexity allowed the union type to be made to work, but causes this one 
> type to consume a disproportionate amount of the vector and client code.
> h4. Proposed Revision to Structure: The Variant Type
> Given the above, we can now present the proposed changes. First let us 
> recognize that a union vector need not hold structured types; there are other 
> solutions as discussed in DRILL-xxxx. This leaves the union vector to hold 
> just scalars.
> h4. Proposed Revision to Storage
> This in turn lets us adopt the [Variant 
> type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in 
> Visual Basic. Variant “is a tagged union that can be used to represent any 
> other data type”. The Variant type was designed to be compact by building on 
> the idea of a tagged union in C.
> {code}
> struct {
>   int tag; // type
>   union {
>     int intValue;
>     long longValue;
>     …
>   }
> }
> {code}
> When implemented as a vector, the format could consume just a single 
> variable-width vector with each entry of the form: {{\[type value]}}. The 
> vector is simply a sequence of these (type, value) pairs.
> The type is a single-byte that encodes the MinorType that describes the 
> value. That is, the type byte is like the existing type vector, but stored in 
> the same location as the data. The data is simply the serialized format of 
> data. (Four bytes for an Int, 8 bytes for a Float8 and so on.)
> Variable-width types require an extra field: the type field: {{\[type length 
> value]}}. For example, a Varchar would be encoded as {{\[Varchar 27 
> byte0-26]}}.
> A writer uses the type to drive the serialization. A reader uses the type to 
> drive deserialization.
> Note that the type field must include a special marker for nulls. Today, the 
> union type uses 0 to indicate a null value. (Note that, in a union and 
> variant, a null value is not a null of some type, both the type and value are 
> null.) That form should be used in the variant representation as well. But, 
> note that the 0 value in the MajorType enum is not Null but is instead Late. 
> This is an unpleasant messiness that the union (and variant )encoding must 
> handle.
> An offset vector provides the location of each field, as is done with 
> variable-length vectors today.
> The result is huge compaction of space requirements from multiple vectors per 
> type to just two vectors (offsets and data.)
> Such a change would be daunting if clients work directly with vectors. 
> However, with the introduction of the “result set loader” and “reader” 
> abstractions, this change in format would be completely hidden from client 
> code. The “result set” abstractions provide high level APIs that isolate 
> clients from implementation, allowing changes such as this.
> h4. Arrow Union Types
> [Arrow|https://arrow.apache.org/docs/memory_layout.html] (see “Dense union 
> type”) has retained Drill’s union vector design: it contains:
> * One child array for each relative type
> * Types buffer…
> * Offsets buffer…
> Unlike Drill, Arrow also has a “Sparse union type” that omits the offsets 
> buffer if the child types are all of the same length.
> The variant type is an opportunity for Drill to lead based on our extensive 
> experience with vectors in production systems. Once the variant type is 
> proven in production, we can offer it to Arrow as part of the Drill/Arrow 
> integration.
> h4. Backward Compatibility
> While the actual code change is quite straightforward, the far larger 
> challenge is backward compatibility. Drill offers both JDBC and ODBC drivers. 
> These drivers make use of Drills internal vector storage format. Thus, any 
> change to the vector format will appear on the wire and must be understood by 
> these clients.
> Drill does not, unfortunately, provide a versioned API to deal with these 
> issues. See DRILL-5957 for a proposed solution to the version negotiation 
> problem.
> For the union vector, let’s say the variant alternative is introduced in 
> version Y. If a version X (older) client connects, the server converts the 
> variant type to union format before sending to the client.
> Thus, before we can change the union vector (or, for that matter, any 
> vector), we must release clients that understand the version handshake 
> protocol. Then, once those clients are deployed, a following server version 
> can make the vector changes.
> Note that this same issue will arise (only in much more complex form) if 
> Drill were to adopt Arrow.
> h4. Seed of a Row-Based Storage Format
> Drill is a columnar engine. However, there are a few situations in which a 
> row-based storage format would improve Drill performance and/or simplicity:
> * JDBC and ODBC clients work with vectors today, but would prefer to work 
> with rows. (The drivers contain complex code to do the column-to-row rotation 
> on the client.)
> * Hash exchanges broadcast each row to a different host, but today do that by 
> buffering rows until gathering a large enough batch to send, causing severe 
> memory pressure. Row-by-row sending would be faster and more memory efficient.
> If the variant format were to be available, a simple extension is to use the 
> same encoding for a row format.
> * An offset vector, indexed by column, gives the start location of each 
> column.
> * The row buffer is a sequence of (type, value) pairs (for fixed-width) or 
> (type, length, value) triples (for variable-width types.)
> The same encoder/decoder that handles a column of heterogeneous values could 
> also handle the same structure that represents a row of such values.
> h4. SQL-level Variant Semantics
> The union vector (and the proposed new “variant” vector) exist to hold a 
> variety of types. However, SQL is designed to work with just a single type. 
> Therefore, we must consider not just storage representation, but also query 
> semantics.
> A challenge is that neither JDBC nor ODBC were designed for variants, nor do 
> most analytic tools know how to interpret varying data types. Indeed, since 
> these APIs and tools are designed for relational data (in which the type of 
> each column is known and fixed), then it is the job of the query tool to 
> determine the column type.
> This means that, when using JDBC and ODBC, all union/variant processing must 
> be done within Drill itself, with the client seining a single, combined 
> output type after some internal operation to produce that combined type.
> One simple use case is to handle type schema changes within an input. For 
> example, in JSON, a value might first present as an Integer, later as a 
> Float. Or, a value might start small enough for a Float, but later present 
> values that require a BigDecimal.
> In such cases, a variant type allows Drill to hold values that correspond to 
> how the JSON parser retrieved the values.
> To use those values in SQL, however, the user must unify them, perhaps with a 
> Cast. For example, in the mixed-number case above, the user might cast the 
> column to a decimal.
> h4. Alternatives to the Union/Variant Types
> Here, however, we can take a step back and ask a larger question. If the 
> union/variant vector is to handle schema changes, might it be better to 
> simply push the final schema down to the reader, and simply interpret the 
> data as the final value at read time? That is, if we could tell the JSON 
> reader (say) that column “x” is a Decimal, then the reader can do the 
> conversion, saving all the complexity of a union (or variant) vector and 
> casting.
> One way to do this is to “push” cast operations into the reader by providing 
> the reader not just column names, but with the types as well. That is, 
> projected columns are not just names, they are (name, type) pairs.
> The above cannot solve the {{SELECT *}} case, however, as the user has chosen 
> not to specify names (let alone types.)
> A more general solution is to allow the user to specify the column types as 
> metadata (as is already done in all other query tools, perhaps via Hive.) 
> Then, the user need not specify the types via casts in each query. Because 
> the types are known at read time, {{SELECT *}} works fine. As a result, the 
> need for a union/variant never arises.
> Here it is worth pointing out that Drill must still be able to query data 
> without a schema. But, type conflicts may appear since Drill can’t predict 
> the future. The user than makes a decision that the easiest path forward for 
> their own use case is to 1) live with the issue, 2) add casts to each query, 
> 3) add casts to a per-file view, or 4) provide metadata that solves the 
> problem once and for all.
> Given this there other cases where we actually do need the union type? Do we 
> have compelling use cases? If not, then the best path forward to fix the 
> union type is simply to retire it in favor of the type hints described above.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to