[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Addressed Karthik's comments, rebased on master, resolved conflicts, and squashed all but the conflict resolution commit. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Thanks, Parth! Will make another commit to address small issues that Karthik pointed out. Let's hold off the actual commit until Drill 1.12 ships. I'll then commit the changes when we open things up again for Drill 1.13 changes. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/914 +1. Ship it! ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/914 I was hoping that we would just get better encapsulation without losing performance. This performance boost is rather serendipitous. One possible explanation might be in the JVM optimizing away code paths in the benchmark itself. This means that we might not see the same performance gains in real life. @bitblender knows more about this than I do. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 There has been discussion recently on Drill's goal of integrating with Arrow. The work to use the `Drillbuf` highlights how we can integrate with Arrow. Simply replace the `Drillbuf` usage with `ArrowBuf`, make the required changes in vector names, add a few methods to the Arrow API, and this entire mechanism can be easily ported over to use Arrow. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Regarding the use of memory addresses. The only reason to do so is performance. To show the benefit of using addresses, I reran the `PerformanceTool` class to test the original code, the code using addresses, and a version that uses DrillBuf as @parthchandra suggested. I expected to see that using addresses was a winner. That's not at all what happened. The code contains a class, `PerformanceTool` that compares the column writers with the original vector mutators. It loads a vector to 16 MB in size, repeated 300 times. The following are the run times, in ms. Vector Type | Original | New w/Address | New w/Drillbuf | | | - Required | 5703 | 4034 | 1461 Nullable | 12743 | 3645 | 3411 Repeated | 20430 | 7226 | 2669 Here: * "Original" column uses the original int vector mutator class. * "New w/Address" shows the same exercise, using the version of the vector writers based on a direct memory address. * "New w/Drillbuf" shows the vector writers, but using the technique Parth suggested to create "unsafe" methods on the `Drillbuf` class. The test is run with a pre-allocated vector (no double-and-copy operations). See `PerformanceTool` for details. I have no explanation for why the `Drillbuf` version should be faster at all, let alone far faster; but I'll take it. The latest commit contains the code after this revision. So, thank you Parth, you were right again with what turned out to be an outstanding performance boost. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/914 Just FYI, this (https://github.com/d-t-w/netty-buffers) seems to indicate that some of Netty's internal fragmentation issues seem to have been addressed since 4.0.37 ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Rebased on master. Squashed all previous commits; left code review revisions separate for now. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Pushed a commit with the changes suggested by Parth's review comments. @parthchandra, please review. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Finally, a note on the fragmentation issue. As you noted, this is a subtle issue. It is true that Netty maintains a memory pool, based on binary allocations, that minimizes the normal kind of fragmentation that results from random sized allocations from a common pool. The cost of the binary structure is _internal_ fragmentation. Today, Drill vectors have, on average, 25% internal fragmentation. This PR does not address this issue per-se, but sets us on the road toward a solution. The key fragmentation issue that this PR _does_ deal with is that which occurs when allocations exceed the 16 MB (default) Netty block size. In that case, Netty does, in fact, go to the OS. The OS does a fine job of coalescing large blocks to prevent fragmentation. The problem, however, is that, over time, more and more memory resides in the Netty free list. Eventually, there simply is not enough memory left outside of Netty to service a jumbo (> 16MB) block. Drill gets an OOM error though Netty has many GB of memory free; just none available in the 32+ MB size we want. We could force Netty to release unused memory. In fact, the original [JE-Malloc paper](https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf) (that you provided way back when, thanks) points out that the allocator should monitor its pools and release memory back to the system when a pool usage drops to zero. It does not appear that `PooledByteBufAllocatorL` implemented this feature, so the allocator never releases memory once it lands in the allocator's free list. We could certainly fix this; the JE-Malloc paper provides suggestions. Still, however, we could end up with usage patterns in which some slice of memory is used from each chunk, blocking any chunk from being released to the OS, and thereby blocking a "jumbo" block allocation, again though much memory is free on the free list. This is yet another form of fragmentation. Finally, as you point out, all of this assumes that we want to continue to allocate "jumbo" blocks. But, as we discovered in the managed sort work, and the hash agg spill work, Drill has two conflicting tendencies. On the one hand, "managed" operators wish to operate within a constrained memory footprint. (Which seems to often end up being on the order of 30 MB for the sort for various reasons.) If the scan operator, say, decides to allocate a batch that contains 32 MB vectors, then the sort can't accept even one of those batches an an OOM ensues. So, rather than solve our memory fragmentation issues by mucking with Netty (force free of unused chunks, increase chunk size, etc.) The preferred solution is to live within a budget: both the constraints of the Netty chunk size *and* the constraints placed on Drill operator memory usage. In short, we started by wanting to solve the fragmentation issue, but we realized that the best solution is to also solve the unlimited-batch-size issue, hence this PR. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 @parthchandra, thank you for taking the time to review this large chunk of code. Thanks for noticing the tests; I see those as the only way to ensure that a change of this size actually works; expecting a human to "mentally execute" the code is not practical at this scale. I did incorporate the change you suggested for grabbing the chunk size from `PooledByteBufAllocatorL` via a new static method in `AllocationManager`. Very good suggestion; thanks. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Thank you, @bitblender, for your in-person review. I've pushed a commit that reflects your comments. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Rebased on to master with conflicts resolved. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Added final development stage to this PR. This is a minor implementation tweak. The result set loader internals for maps held redundant map vectors. In the final vector container, each map must be represented by a map (or repeated map) vector. But, because the result set loader handles projection and overflow, the set of columns that a map writer works with is a superset of those that appear in the output container. For this reason, there turns out to be no reason to maintain these "redundant" map vectors. For repeated map vectors, we must maintain the offset vector. The code already did this; we just strip off the enclosing repeated map vector. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Rebased onto master to resolve conflict in `exec/jdbc-all/pom.xml`. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 This next commit reintroduces a projection feature. With this change, a client can: * Define the set of columns to project * Define the schema of the data source (table, file, etc.) * Write columns according to the schema * Harvest only the projected columns. ### Example Here is a simple example adapted from `TestResultSetLoaderProjection`. First, declare the projection ``` List selection = Lists.newArrayList( SchemaPath.getSimplePath("c"), SchemaPath.getSimplePath("b"), SchemaPath.getSimplePath("e")); ``` Then, declare the schema. (Here, we declare the schema up-front. Projection also works if the schema is defined as columns are discovered while creating a batch.) ``` TupleMetadata schema = new SchemaBuilder() .add("a", MinorType.INT) .add("b", MinorType.INT) .add("c", MinorType.INT) .add("d", MinorType.INT) .buildSchema(); ``` Then, use the options mechanisms to pass the information to the result set loader: ``` ResultSetOptions options = new OptionBuilder() .setProjection(selection) .setSchema(schema) .build(); ResultSetLoader rsLoader = new ResultSetLoaderImpl(fixture.allocator(), options); ``` Now, we can write the four columns in the data source: ``` RowSetLoader rootWriter = rsLoader.writer(); rsLoader.startBatch(); ⦠rootWriter.start(); rootWriter.scalar(âaâ).setInt(10); rootWriter.scalar(âbâ).setInt(20); rootWriter.scalar(âcâ).setInt(30); rootWriter.scalar(âdâ).setInt(40); rootWriter.save(); ``` But, when we harvest the results, we get only the projected columns. Notice that âeâ is projected, but does not exist in the table, and so is not projected to the output. A higher level of code will handle this case. ``` #: b, c 0: 20, 30 ``` ### Maps Although the above example does not show the feature, the mechanism also handles maps and arrays of maps. The rules are: * If the projection list includes specific map members (such as âm.bâ), then project only those map members. * If the projection list includes just the map name (such as âmâ), then project all map members (such as âm.aâ and âm.bâ.) * If the projection list does not include the map at all, then project neither the map nor any of its members. ### Implementation The implementation builds on previous commits. The idea is that we create a âdummyâ column and writer, but we do not create the underlying value vector. This allows the client to be blissfully ignorant of whether the column is projected or not. On the other hand, if the client wants to know if a column is projected (perhaps to optimize away certain operations), then the projection status is available in the column metadata. Projection Set Projection starts with a `ProjectionSet` abstraction. Each tuple (row, map) has a projection set. The projection set can be a set of names (`ProjectionSetImpl`) or a default (`NullProjectionSet`). Result Set Loader The result set loader is modified to check if a column is projected. If so, the code flow is the same as previously. If not, then the code will create the dummy vector state and dummy writers described above. Adding support for non-projected columns involved the usual amount of refactoring and moving bits of code around to get a simple solution. Accessor Factories Prior versions had a `ColumnAccessorFactory` class that created both readers and writers. This commit splits the class into separate reader and writer factories. The writer factory now creates dummy writers if asked to create a writer when the backing vector is null. To make this easier, factory code that previously appeared in each writer has moved into the writer factory. (Note that readers donât support projection: there is no need.) Dummy Writers The accessor layer is modified to create a set of dummy writers. Scalar writers have a wide (internal) interface. Dummy scalar writers simply ignore the unsupported operations. Dummy array and tuple writers are also provided. Unit Test The new `TestResultSetLoaderProjection` test exercises the new code. The new `DummyWriterTest` exercises the dummy writers. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 This commit introduces a feature to limit memory consumed by a batch. ### Batch Size Limits With this change, the code now has three overlapping limits: * The traditional row-count limit. * A maximum limit of 16 MB per vector. * The new memory-per-batch limit. ### Overall Flow for Limiting Batch Memory Usage The batch size limit builds on the work already done for overflow. * The column metadata allows the client to specify allocation hints such as expected Varchar width and array cardinality. * The result set loader allocates a batch using the hints and target row count. * The result set loader measures the memory allocated above. This is the initial batch size. * As the writers find the need to extend a vector, the writer calls a listener to ask if the extension is allowed, passing in the amount of growth expected. * The result set loader adds the delta to the accumulated total, compares this against the size limit, and returns whether the resize is allowed. * If the resize is not allowed, an overflow is triggered. Note that the above reuses the overflow mechanism, allowing the size limit to be handled even if reached in the middle of a row. ### Implementation Details To make the above work: * A new batch size limit is added to the result set loader options. * The batch size tracking code is added. This required a new method in the value vectors to report actual allocated memory. * The scalar accessors are refactored to add in the batch size limitation without introducing duplicated code. Code moved from the template to base classes to factor out redundancy. * General code clean-up in the vector limit found while doing the above work. * Unit tests for the new mechanism. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Rebased onto latest master. ---
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 The second group of five commits refines the result set loader mechanism. ### Model Layer Revision Drill is unique for a query engine in that it handles structured types as first-class data types. For example, Drill supports maps (called âstructsâ by Hive and Impala), but also supports arrays of maps. Drill support simple scalars, and also arrays of scalars. Put these together and we have maps that contain arrays of maps that contain arrays of scalars. The result is a tree structure described in the Drill documentation as modeled on JSON. The prior version used a âmodelâ layer to construct internal structures that models the tree-structure of Drillâs data types. The model âreifiedâ the structure into a set of objects. While this worked well, it added more complexity than necessary, especially when dynamically evolving a schema or working out how to handle scan-time projection. This version retains the model layer, but as a series of algorithms that walk the vector container structure rather than as a separate data structure. The model layer still provides tools to build readers for âsingleâ and âhyperâ vectors, to extract a metadata schema from a set of vectors, to create writers for a âsingleâ vector container, and so on. Replacing the object structure with algorithms required changes to both the row set abstractions and the result set loader. The key loss in this change is the set of âvisitorsâ from the previous revision. The reified model allowed all visitors to use a common structure. The new solution still has visitors, but now they are ad-hoc, walking the container tree in different ways depending on whether the code can work with columns generically, or needs to deal with individual (single) vectors or hyper vectors. ### Revised Result Set Loader Column and Vector State To understand the need for many of the changes in this commit, it helps to take a step back and remember what weâre trying to do. We want to write to vectors, but control the resulting batch size. Background Writing to vectors is easy if we deal only with flat rows and donât worry about batch size: * Vectors provide `Mutator` classes that write to single vectors. * A set of âlegacyâ vector writers are available, and are used by some readers. * Generated code uses `Mutator` and `Accessor` classes to work with vectors. * The âRow Setâ classes, used for testing, provides a refined column writer to populate batches. The above are the easy parts. Some challenges include: * None of the above limit batch memory, they only limit row count. * Writing directly to vectors requires that the client deal with the complexity of tracking a common position across vectors. * Drillâs tree structure makes everything more complex as positions must be tracked across multiple repetition levels (see below). The âresult set loaderâ (along with the column writers) provides a next level of completeness by tackling the vector memory size problem, for the entire set of Drill structured types. Overflow and Rollover The key trick is to handle vector âoverflowâ seamlessly shifting writes, mid-row, from a full batch, to a new âlook-aheadâ batch. The process of shifting data is called ârollover.â To implement rollover, we need to work with two sets of vectors: * The âactiveâ set: the vectors âunderâ the writers and returned downstream. * The âbackupâ set: holds the buffers not currently in use. During an overflow event, buffers are shifted between the active and backup vectors: * On overflow, the full buffers reside in the active vectors. * After rollover, the full buffers reside in the backup vectors, and new buffers, now holding the in-flight row (called the âlook-aheadâ row), reside in the active vectors. * When âharvestingâ the full batch, the full buffers and look-ahead vectors are exchange, so the full buffers are back in the active vectors. * When starting the next batch for writing, the look-ahead row is shifted from the backup vectors into the active vectors and the cycle repeats. Column and Vector State When writing without overflow, we need only one vector and so the usual Drill vector container is sufficient. With overflow, we have two sets of vectors, and must perform operations on them, so we need a place to store the data. This is the purpose of the âcolumn stateâ and âvector stateâ classes. Think of the overall result set loader structure a has having three key parts: * Result set loader: manages the entire batch * Column writers: accepts writes to vectors
[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/914 Merged a set of enhancements & bug fixes. Compared to the first set of commits: * Provides tools to print the sometimes-complex tuple model, schema and writer object used for complex data types. * Sharpens up the specification of the client API used to write data, especially for obscure cases such as abandoning rows, vector overflow inside deep structures, etc. * Revises the internal âWriterEventsâ API used to pass events down the writer hierarchy in response to the API revisions. * Fully handles the case where a client chooses to âabandonâ and rewrite a row (perhaps based on a filter condition.) * A couple of minor vector revisions. RepeatedMapVector allows building an instance from an existing offset vector (needed when harvesting batches with overflow.) * Adds more Javadoc. * Added many unit tests. * Fixes in response to unit tests. * Moved vector cache into result set loader layer. Remaining open issues: * Column projection * Support for lists and repeated lists --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---