[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-22 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Addressed Karthik's comments, rebased on master, resolved conflicts, and 
squashed all but the conflict resolution commit.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-21 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Thanks, Parth! Will make another commit to address small issues that 
Karthik pointed out. Let's hold off the actual commit until Drill 1.12 ships. 
I'll then commit the changes when we open things up again for Drill 1.13 
changes.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-17 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/914
  
+1. Ship it!


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-17 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/914
  
I was hoping that we would just get better encapsulation without losing 
performance. This performance boost is rather serendipitous.
One possible explanation might be in the JVM optimizing away code paths in 
the benchmark itself. This means that we might not see the same performance 
gains in real life. @bitblender knows more about this than I do. 



---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-17 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
There has been discussion recently on Drill's goal of integrating with 
Arrow. The work to use the `Drillbuf` highlights how we can integrate with 
Arrow. Simply replace the `Drillbuf` usage with `ArrowBuf`, make the required 
changes in vector names, add a few methods to the Arrow API, and this entire 
mechanism can be easily ported over to use Arrow.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-16 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Regarding the use of memory addresses. The only reason to do so is 
performance. To show the benefit of using addresses, I reran the 
`PerformanceTool` class to test the original code, the code using addresses, 
and a version that uses DrillBuf as @parthchandra suggested. I expected to see 
that using addresses was a winner. That's not at all what happened.

The code contains a class, `PerformanceTool` that compares the column 
writers with the original vector mutators. It loads a vector to 16 MB in size, 
repeated 300 times. The following are the run times, in ms.

Vector Type | Original  | New w/Address | New w/Drillbuf
 |   |   | - 
Required | 5703 | 4034 | 1461
Nullable | 12743 | 3645 | 3411
Repeated | 20430 | 7226 | 2669

Here:

* "Original" column uses the original int vector mutator class.
* "New w/Address" shows the same exercise, using the version of the vector 
writers based on a direct memory address.
* "New w/Drillbuf" shows the vector writers, but using the technique Parth 
suggested to create "unsafe" methods on the `Drillbuf` class.

The test is run with a pre-allocated vector (no double-and-copy 
operations). See `PerformanceTool` for details.

I have no explanation for why the `Drillbuf` version should be faster at 
all, let alone far faster; but I'll take it. The latest commit contains the 
code after this revision.

So, thank you Parth, you were right again with what turned out to be an 
outstanding performance boost.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-16 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/914
  
Just FYI, this (https://github.com/d-t-w/netty-buffers) seems to indicate 
that some of Netty's internal fragmentation issues seem to have been addressed 
since 4.0.37


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Rebased on master. Squashed all previous commits; left code review 
revisions separate for now.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Pushed a commit with the changes suggested by Parth's review comments. 
@parthchandra, please review.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Finally, a note on the fragmentation issue. As you noted, this is a subtle 
issue. It is true that Netty maintains a memory pool, based on binary 
allocations, that minimizes the normal kind of fragmentation that results from 
random sized allocations from a common pool.

The cost of the binary structure is _internal_ fragmentation. Today, Drill 
vectors have, on average, 25% internal fragmentation. This PR does not address 
this issue per-se, but sets us on the road toward a solution.

The key fragmentation issue that this PR _does_ deal with is that which 
occurs when allocations exceed the 16 MB (default) Netty block size. In that 
case, Netty does, in fact, go to the OS. The OS does a fine job of coalescing 
large blocks to prevent fragmentation. The problem, however, is that, over 
time, more and more memory resides in the Netty free list. Eventually, there 
simply is not enough memory left outside of Netty to service a jumbo (> 16MB) 
block. Drill gets an OOM error though Netty has many GB of memory free; just 
none available in the 32+ MB size we want.

We could force Netty to release unused memory. In fact, the original 
[JE-Malloc 
paper](https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf) 
(that you provided way back when, thanks) points out that the allocator should 
monitor its pools and release memory back to the system when a pool usage drops 
to zero. It does not appear that `PooledByteBufAllocatorL` implemented this 
feature, so the allocator never releases memory once it lands in the 
allocator's free list. We could certainly fix this; the JE-Malloc paper 
provides suggestions.

Still, however, we could end up with usage patterns in which some slice of 
memory is used from each chunk, blocking any chunk from being released to the 
OS, and thereby blocking a "jumbo" block allocation, again though much memory 
is free on the free list. This is yet another form of fragmentation.

Finally, as you point out, all of this assumes that we want to continue to 
allocate "jumbo" blocks. But, as we discovered in the managed sort work, and 
the hash agg spill work, Drill has two conflicting tendencies. On the one hand, 
"managed" operators wish to operate within a constrained memory footprint. 
(Which seems to often end up being on the order of 30 MB for the sort for 
various reasons.) If the scan operator, say, decides to allocate a batch that 
contains 32 MB vectors, then the sort can't accept even one of those batches an 
an OOM ensues.

So, rather than solve our memory fragmentation issues by mucking with Netty 
(force free of unused chunks, increase chunk size, etc.) The preferred solution 
is to live within a budget: both the constraints of the Netty chunk size *and* 
the constraints placed on Drill operator memory usage.

In short, we started by wanting to solve the fragmentation issue, but we 
realized that the best solution is to also solve the unlimited-batch-size 
issue, hence this PR.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
@parthchandra, thank you for taking the time to review this large chunk of 
code. Thanks for noticing the tests; I see those as the only way to ensure that 
a change of this size actually works; expecting a human to "mentally execute" 
the code is not practical at this scale.

I did incorporate the change you suggested for grabbing the chunk size from 
`PooledByteBufAllocatorL` via a new static method in `AllocationManager`. Very 
good suggestion; thanks.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-11-13 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Thank you, @bitblender, for your in-person review. I've pushed a commit 
that reflects your comments. 


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-10-02 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Rebased on to master with conflicts resolved.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-09-26 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Added final development stage to this PR. This is a minor implementation 
tweak. The result set loader internals for maps held redundant map vectors. In 
the final vector container, each map must be represented by a map (or repeated 
map) vector. But, because the result set loader handles projection and 
overflow, the set of columns that a map writer works with is a superset of 
those that appear in the output container. For this reason, there turns out to 
be no reason to maintain these "redundant" map vectors.

For repeated map vectors, we must maintain the offset vector. The code 
already did this; we just strip off the enclosing repeated map vector.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-09-21 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Rebased onto master to resolve conflict in `exec/jdbc-all/pom.xml`.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-09-21 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
This next commit reintroduces a projection feature. With this change, a 
client can:

* Define the set of columns to project
* Define the schema of the data source (table, file, etc.)
* Write columns according to the schema
* Harvest only the projected columns.

### Example

Here is a simple example adapted from `TestResultSetLoaderProjection`. 
First, declare the projection

```
List selection = Lists.newArrayList(
SchemaPath.getSimplePath("c"),
SchemaPath.getSimplePath("b"),
SchemaPath.getSimplePath("e"));
```

Then, declare the schema. (Here, we declare the schema up-front. Projection 
also works if the schema is defined as columns are discovered while creating a 
batch.)

```
TupleMetadata schema = new SchemaBuilder()
.add("a", MinorType.INT)
.add("b", MinorType.INT)
.add("c", MinorType.INT)
.add("d", MinorType.INT)
.buildSchema();
 ```

Then, use the options mechanisms to pass the information to the result set 
loader:

```
   ResultSetOptions options = new OptionBuilder()
.setProjection(selection)
.setSchema(schema)
.build();
ResultSetLoader rsLoader = new ResultSetLoaderImpl(fixture.allocator(), 
options);
```

Now, we can write the four columns in the data source:

```
RowSetLoader rootWriter = rsLoader.writer();
rsLoader.startBatch();
…
rootWriter.start();
rootWriter.scalar(“a”).setInt(10);
rootWriter.scalar(“b”).setInt(20);
rootWriter.scalar(“c”).setInt(30);
rootWriter.scalar(“d”).setInt(40);
rootWriter.save();

```

But, when we harvest the results, we get only the projected columns. Notice 
that “e” is projected, but does not exist in the table, and so is not 
projected to the output. A higher level of code will handle this case.

```
#: b, c
0: 20, 30
```

### Maps

Although the above example does not show the feature, the mechanism also 
handles maps and arrays of maps. The rules are:

* If the projection list includes specific map members (such as “m.b”), 
then project only those map members.
* If the projection list includes just the map name (such as “m”), then 
project all map members (such as “m.a” and “m.b”.)
* If the projection list does not include the map at all, then project 
neither the map nor any of its members.

### Implementation

The implementation builds on previous commits. The idea is that we create a 
“dummy” column and writer, but we do not create the underlying value 
vector. This allows the client to be blissfully ignorant of whether the column 
is projected or not. On the other hand, if the client wants to know if a column 
is projected (perhaps to optimize away certain operations), then the projection 
status is available in the column metadata.

 Projection Set

Projection starts with a `ProjectionSet` abstraction. Each tuple (row, map) 
has a projection set. The projection set can be a set of names 
(`ProjectionSetImpl`) or a default (`NullProjectionSet`).

 Result Set Loader

The result set loader is modified to check if a column is projected. If so, 
the code flow is the same as previously. If not, then the code will create the 
dummy vector state and dummy writers described above.

Adding support for non-projected columns involved the usual amount of 
refactoring and moving bits of code around to get a simple solution.

 Accessor Factories

Prior versions had a `ColumnAccessorFactory` class that created both 
readers and writers. This commit splits the class into separate reader and 
writer factories. The writer factory now creates dummy writers if asked to 
create a writer when the backing vector is null. To make this easier, factory 
code that previously appeared in each writer has moved into the writer factory. 
(Note that readers don’t support projection: there is no need.)

 Dummy Writers

The accessor layer is modified to create a set of dummy writers. Scalar 
writers have a wide (internal) interface. Dummy scalar writers simply ignore 
the unsupported operations. Dummy array and tuple writers are also provided.

 Unit Test

The new `TestResultSetLoaderProjection` test exercises the new code. The 
new `DummyWriterTest` exercises the dummy writers.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-09-19 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
This commit introduces a feature to limit memory consumed by a batch.

### Batch Size Limits

With this change, the code now has three overlapping limits:

* The traditional row-count limit.
* A maximum limit of 16 MB per vector.
* The new memory-per-batch limit.

### Overall Flow for Limiting Batch Memory Usage

The batch size limit builds on the work already done for overflow.

* The column metadata allows the client to specify allocation hints such as 
expected Varchar width and array cardinality.
* The result set loader allocates a batch using the hints and target row 
count.
* The result set loader measures the memory allocated above. This is the 
initial batch size.
* As the writers find the need to extend a vector, the writer calls a 
listener to ask if the extension is allowed, passing in the amount of growth 
expected.
* The result set loader adds the delta to the accumulated total, compares 
this against the size limit, and returns whether the resize is allowed.
* If the resize is not allowed, an overflow is triggered.

Note that the above reuses the overflow mechanism, allowing the size limit 
to be handled even if reached in the middle of a row.

### Implementation Details

To make the above work:

* A new batch size limit is added to the result set loader options.
* The batch size tracking code is added. This required a new method in the 
value vectors to report actual allocated memory.
* The scalar accessors are refactored to add in the batch size limitation 
without introducing duplicated code. Code moved from the template to base 
classes to factor out redundancy.
* General code clean-up in the vector limit found while doing the above 
work.
* Unit tests for the new mechanism.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-09-17 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Rebased onto latest master.


---


[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-09-16 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
The second group of five commits refines the result set loader mechanism.

### Model Layer Revision

Drill is unique for a query engine in that it handles structured types as 
first-class data types. For example, Drill supports maps (called “structs” 
by Hive and Impala), but also supports arrays of maps. Drill support simple 
scalars, and also arrays of scalars. Put these together and we have maps that 
contain arrays of maps that contain arrays of scalars. The result is a tree 
structure described in the Drill documentation as modeled on JSON.

The prior version used a “model” layer to construct internal structures 
that models the tree-structure of Drill’s data types. The model “reified” 
the structure into a set of objects. While this worked well, it added more 
complexity than necessary, especially when dynamically evolving a schema or 
working out how to handle scan-time projection.

This version retains the model layer, but as a series of algorithms that 
walk the vector container structure rather than as a separate data structure. 
The model layer still provides tools to build readers for “single” and 
“hyper” vectors, to extract a metadata schema from a set of vectors, to 
create writers for a “single” vector container, and so on.

Replacing the object structure with algorithms required changes to both the 
row set abstractions and the result set loader.

The key loss in this change is the set of “visitors” from the previous 
revision. The reified model allowed all visitors to use a common structure. The 
new solution still has visitors, but now they are ad-hoc, walking the container 
tree in different ways depending on whether the code can work with columns 
generically, or needs to deal with individual (single) vectors or hyper vectors.

### Revised Result Set Loader Column and Vector State

To understand the need for many of the changes in this commit, it helps to 
take a step back and remember what we’re trying to do. We want to write to 
vectors, but control the resulting batch size.

 Background

Writing to vectors is easy if we deal only with flat rows and don’t worry 
about batch size:

* Vectors provide `Mutator` classes that write to single vectors.
* A set of “legacy” vector writers are available, and are used by some 
readers.
* Generated code uses `Mutator` and `Accessor` classes to work with vectors.
* The “Row Set” classes, used for testing, provides a refined column 
writer to populate batches.

The above are the easy parts. Some challenges include:

* None of the above limit batch memory, they only limit row count.
* Writing directly to vectors requires that the client deal with the 
complexity of tracking a common position across vectors.
* Drill’s tree structure makes everything more complex as positions must 
be tracked across multiple repetition levels (see below).

The “result set loader” (along with the column writers) provides a next 
level of completeness by tackling the vector memory size problem, for the 
entire set of Drill structured types.

 Overflow and Rollover

The key trick is to handle vector “overflow” seamlessly shifting 
writes, mid-row, from a full batch, to a new “look-ahead” batch. The 
process of shifting data is called “rollover.”

To implement rollover, we need to work with two sets of vectors:

* The “active” set: the vectors “under” the writers and returned 
downstream.
* The “backup” set: holds the buffers not currently in use.

During an overflow event, buffers are shifted between the active and backup 
vectors:

* On overflow, the full buffers reside in the active vectors.
* After rollover, the full buffers reside in the backup vectors, and new 
buffers, now holding the in-flight row (called the “look-ahead” row), 
reside in the active vectors.
* When “harvesting” the full batch, the full buffers and look-ahead 
vectors are exchange, so the full buffers are back in the active vectors.
* When starting the next batch for writing, the look-ahead row is shifted 
from the backup vectors into the active vectors and the cycle repeats.

 Column and Vector State

When writing without overflow, we need only one vector and so the usual 
Drill vector container is sufficient. With overflow, we have two sets of 
vectors, and must perform operations on them, so we need a place to store the 
data. This is the purpose of the “column state” and “vector state” 
classes.

Think of the overall result set loader structure a has having three key 
parts:

* Result set loader: manages the entire batch
* Column writers: accepts writes to vectors

[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

2017-08-25 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/914
  
Merged a set of enhancements & bug fixes. Compared to the first set of 
commits:

* Provides tools to print the sometimes-complex tuple model, schema and 
writer object used for complex data types.
* Sharpens up the specification of the client API used to write data, 
especially for obscure cases such as abandoning rows, vector overflow inside 
deep structures, etc.
* Revises the internal “WriterEvents” API used to pass events down the 
writer hierarchy in response to the API revisions.
* Fully handles the case where a client chooses to “abandon” and 
rewrite a row (perhaps based on a filter condition.)
* A couple of minor vector revisions. RepeatedMapVector allows building an 
instance from an existing offset vector (needed when harvesting batches with 
overflow.)
* Adds more Javadoc.
* Added many unit tests.
* Fixes in response to unit tests.
* Moved vector cache into result set loader layer.

Remaining open issues:

* Column projection
* Support for lists and repeated lists


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---