Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r139296614 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/package-info.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Handles the details of the result set loader implementation. + * <p> + * The primary purpose of this loader, and the most complex to understand and + * maintain, is overflow handling. + * + * <h4>Detailed Use Cases</h4> + * + * Let's examine it by considering a number of + * use cases. + * <table style="border: 1px solid; border-collapse: collapse;"> + * <tr><th>Row</th><th>a</th><th>b</th><th>c</th><th>d</th><th>e</th><th>f</th><th>g</th><th>h</th></tr> + * <tr><td>n-2</td><td>X</td><td>X</td><td>X</td><td>X</td><td>X</td><td>X</td><td>-</td><td>-</td></tr> + * <tr><td>n-1</td><td>X</td><td>X</td><td>X</td><td>X</td><td> </td><td> </td><td>-</td><td>-</td></tr> + * <tr><td>n </td><td>X</td><td>!</td><td>O</td><td> </td><td>O</td><td> </td><td>O</td><td> </td></tr> + * </table> + * Here: + * <ul> + * <li>n-2, n-1, and n are rows. n is the overflow row.</li> + * <li>X indicates a value was written before overflow.</li> + * <li>Blank indicates no value was written in that row.</li> + * <li>! indicates the value that triggered overflow.</li> + * <li>- indicates a column that did not exist prior to overflow.</li> + * </ul> + * Column a is written before overflow occurs, b causes overflow, and all other + * columns either are not written, or written after overflow. + * <p> + * The scenarios, identified by column names above, are: + * <dl> + * <dt>a</dt> + * <dd>a contains values for all three rows. + * <ul> + * <li>Two values were written in the "main" batch, while a third was written to + * what becomes the overflow row.</li> + * <li>When overflow occurs, the last write position is at n. It must be moved + * back to n-1.</li> + * <li>Since data was written to the overflow row, it is copied to the look- + * ahead batch.</li> + * <li>The last write position in the lookahead batch is 0 (since data was + * copied into the 0th row.</li> + * <li>When harvesting, no empty-filling is needed.</li> + * <li>When starting the next batch, the last write position must be set to 0 to + * reflect the presence of the value for row n.</li> + * </ul> + * </dd> + * <dt>b</dt> + * <dd>b contains values for all three rows. The value for row n triggers + * overflow. + * <ul> + * <li>The last write position is at n-1, which is kept for the "main" + * vector.</li> + * <li>A new overflow vector is created and starts empty, with the last write + * position at -1.</li> + * <li>Once created, b is immediately written to the overflow vector, advancing + * the last write position to 0.</li> + * <li>Harvesting, and starting the next for column b works the same as column + * a.</li> + * </ul> + * </dd> + * <dt>c</dt> + * <dd>Column c has values for all rows. + * <ul> + * <li>The value for row n is written after overflow.</li> + * <li>At overflow, the last write position is at n-1.</li> + * <li>At overflow, a new lookahead vector is created with the last write + * position at -1.</li> + * <li>The value of c is written to the lookahead vector, advancing the last + * write position to -1.</li> + * <li>Harvesting, and starting the next for column c works the same as column + * a.</li> + * </ul> + * </dd> + * <dt>d</</dt> + * <dd>Column d writes values to the last two rows before overflow, but not to + * the overflow row. + * <ul> + * <li>The last write position for the main batch is at n-1.</li> + * <li>The last write position in the lookahead batch remains at -1.</li> + * <li>Harvesting for column d requires filling an empty value for row n-1.</li> + * <li>When starting the next batch, the last write position must be set to -1, + * indicating no data yet written.</li> + * </ul> + * </dd> + * <dt>f</dt> + * <dd>Column f has no data in the last position of the main batch, and no data + * in the overflow row. + * <ul> + * <li>The last write position is at n-2.</li> + * <li>An empty value must be written into position n-1 during harvest.</li> + * <li>On start of the next batch, the last write position starts at -1.</li> + * </ul> + * </dd> + * <dt>g</dt> + * <dd>Column g is added after overflow, and has a value written to the overflow + * row. + * <ul> + * <li>On harvest, column g is simply skipped.</li> + * <li>On start of the next row, the last write position can be left unchanged + * since no "exchange" was done.</li> + * </ul> + * </dd> + * <dt>h</dt> + * <dd>Column h is added after overflow, but does not have data written to it + * during the overflow row. Similar to column g, but the last write position + * starts at -1 for the next batch.</dd> + * </dl> + * + * <h4>General Rules</h4> + * + * The above can be summarized into a smaller set of rules: + * <p> + * At the time of overflow on row n: + * <ul> + * <li>Create or clear the lookahead vector.</li> + * <li>Copy (last write position - n) values from row n in the old vector to 0 --- End diff -- Fixed.
---