[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable
Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1238 @parthchandra Please review. Note to a committer: please do *not* squash commits. ---
[GitHub] drill pull request #1238: DRILL-6281: Refactor TimedRunnable
GitHub user vrozov opened a pull request: https://github.com/apache/drill/pull/1238 DRILL-6281: Refactor TimedRunnable You can merge this pull request into a Git repository by running: $ git pull https://github.com/vrozov/drill DRILL-6281 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1238.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1238 commit c523734a563c62f58ea1e7161ad366777ba62035 Author: Vlad Rozov Date: 2018-04-21T01:00:20Z DRILL-6281: Refactor TimedRunnable (rename TimedRunnable to TimedCallable) commit 63a483b393450e95d09911d756caf670f0a1fdb6 Author: Vlad Rozov Date: 2018-04-21T15:07:42Z DRILL-6281: Refactor TimedRunnable ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r183264629 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -277,18 +286,29 @@ public boolean isRepeatedList() { /** * This is the average per entry width, used for vector allocation. */ -public int getEntryWidth() { +private int getEntryWidthForAlloc() { int width = 0; if (isVariableWidth) { -width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH; +width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH; // Subtract out the bits (is-set) vector width -if (metadata.getDataMode() == DataMode.OPTIONAL) { +if (isOptional) { width -= BIT_VECTOR_WIDTH; } + +if (isRepeated && getValueCount() == 0) { + return (safeDivide(width, STD_REPETITION_FACTOR)); --- End diff -- If the value count is zero, but the row count is non-zero, then a very low repetition rate is more realistic than 10. In earlier drafts, I found the repetition rate had to be a float since, in some data, the rate is something like 0.7 or 1.4. Rounding to an integer caused quite an error when multiplying by, say, 50K rows. Any reason we can't use the actual computed amount here? If we really have 1 or 2 rows, then a guess of 10 is fine. But, if we have 60K rows, with an actual estimate of 0, then guessing 10 will allocate 600K values when we probably needed close to 0. (Unless I'm missing something.) ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r183264768 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -277,18 +286,29 @@ public boolean isRepeatedList() { /** * This is the average per entry width, used for vector allocation. */ -public int getEntryWidth() { +private int getEntryWidthForAlloc() { int width = 0; if (isVariableWidth) { -width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH; +width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH; // Subtract out the bits (is-set) vector width -if (metadata.getDataMode() == DataMode.OPTIONAL) { +if (isOptional) { width -= BIT_VECTOR_WIDTH; } + +if (isRepeated && getValueCount() == 0) { + return (safeDivide(width, STD_REPETITION_FACTOR)); +} } - return (safeDivide(width, cardinality)); + return (safeDivide(width, getEntryCardinalityForAlloc())); +} + +/** + * This is the average per entry cardinality, used for vector allocation. + */ +private float getEntryCardinalityForAlloc() { + return getCardinality() == 0 ? (isRepeated ? STD_REPETITION_FACTOR : 1) :getCardinality(); --- End diff -- I'm a bit curious: under what scenario do we want to allocate vectors given no input rows? Can we just wait to see data before we do our allocations? It is very hard to make reasonable estimates of future size based on no data... ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r183265234 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/record/TestRecordBatchSizer.java --- @@ -45,7 +45,7 @@ private void verifyColumnValues(ColumnSize column, int stdDataSizePerEntry, int stdNetSizePerEntry, int dataSizePerEntry, int netSizePerEntry, int totalDataSize, int totalNetSize, int valueCount, int elementCount, - int estElementCountPerArray, boolean isVariableWidth) { + int cardinality, boolean isVariableWidth) { --- End diff -- Maybe explain the cardinality here since I've seen it (or have used it) in three different ways: * Type cardinality: (Required: 1, Optional: (0, 1), Repeated: *) * Batch cardinality: the number of rows in a batch. (Or, equivalently, the number of top-level values in a vector.) * Array cardinality: the number of values in an array (a single top-level value which is repeated) "Cardinality" is a good word, may just need disambiguation. ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r183264235 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -50,7 +50,7 @@ public class RecordBatchSizer { private static final int OFFSET_VECTOR_WIDTH = UInt4Vector.VALUE_WIDTH; private static final int BIT_VECTOR_WIDTH = UInt1Vector.VALUE_WIDTH; - private static final int STD_REPETITION_FACTOR = 10; + public static final int STD_REPETITION_FACTOR = 10; --- End diff -- This is another of those silly fudge factors that really have no meaning. The value of 10 came from the vector allocation code in `AllocationHelper` (or I thought it did, the magic number there is 5.) Maybe move this to `AllocationHelper` and set it to 5, then use it here and in `AllocationHelper` so we use a consistent guess everywhere. ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r183264996 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) { */ private int netRowWidth; private int netRowWidthCap50; + + /** + * actual row size if input is not empty. Otherwise, standard size. + */ + private int rowAllocSize; --- End diff -- I wonder if this all would be clearer if we handed it at size estimation time. If the row count is 0, set up everything using the standard sizes. (Note: the whole reason this class exists is that the standard sizes turned out to be *very* poor estimators of actual size.) So, if we have no data, guess the same size as `AllocationHelper`, else use real sizes. And, again the question: under what situation do we want to use the sizer if we don't actually have any data? For the first batch, if no data, just throw away the empty batch and don't size it. Turn around and get another until we receive a non-empty batch. If we've already received at least one non-empty batch, then we receive an empty batch, we should just retain the estimates from the non-empty batch since they will be much better than just making up numbers. ---
[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1218#discussion_r183262311 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/RowSetLoaderImpl.java --- @@ -95,4 +96,10 @@ public void endBatch() { @Override public int rowCount() { return rsLoader.rowCount(); } + + @Override + public ColumnMetadata schema() { +// No column for the row tuple +return null; --- End diff -- Expanded comment. Let me know if it is still not clear. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1237 DRILL-6348: Fixed code so that Unordered Receiver reports its memory ⦠Problem Description - - The Unordered Receiver doesn't report any memory usage because the RPC infrastructure associated the allocated memory ownership to the minor fragment container Proposed Fix - - Modified the code to change the received DrillBuf memory ownership from the parent fragment to the operator (as discussed with @parthchandra) - Made sure that the buffer accounting logic is preserved @parthchandra, can you please review this pull request? Thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill DRILL-6348 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1237 commit 9ab501a3fc3ebbda44c1478ad9db6711192b8ca1 Author: Salim Achouche Date: 2018-04-23T01:02:35Z DRILL-6348: Fixed code so that Unordered Receiver reports its memory usage ---
[jira] [Created] (DRILL-6348) Unordered Receiver does not report its memory usage
salim achouche created DRILL-6348: - Summary: Unordered Receiver does not report its memory usage Key: DRILL-6348 URL: https://issues.apache.org/jira/browse/DRILL-6348 Project: Apache Drill Issue Type: Task Components: Execution - Flow Reporter: salim achouche Assignee: salim achouche Fix For: 1.14.0 The Drill Profile functionality doesn't show any memory usage for the Unordered Receiver operator. This is problematic when analyzing OOM conditions since we cannot account for all of a query memory usage. This Jira is to fix memory reporting for the Unordered Receiver operator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)