[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable

2018-04-22 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1238
  
@parthchandra Please review.

Note to a committer: please do *not* squash commits.


---


[GitHub] drill pull request #1238: DRILL-6281: Refactor TimedRunnable

2018-04-22 Thread vrozov
GitHub user vrozov opened a pull request:

https://github.com/apache/drill/pull/1238

 DRILL-6281: Refactor TimedRunnable



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vrozov/drill DRILL-6281

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1238.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1238


commit c523734a563c62f58ea1e7161ad366777ba62035
Author: Vlad Rozov 
Date:   2018-04-21T01:00:20Z

DRILL-6281: Refactor TimedRunnable (rename TimedRunnable to TimedCallable)

commit 63a483b393450e95d09911d756caf670f0a1fdb6
Author: Vlad Rozov 
Date:   2018-04-21T15:07:42Z

DRILL-6281: Refactor TimedRunnable




---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-22 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r183264629
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -277,18 +286,29 @@ public boolean isRepeatedList() {
 /**
  * This is the average per entry width, used for vector allocation.
  */
-public int getEntryWidth() {
+private int getEntryWidthForAlloc() {
   int width = 0;
   if (isVariableWidth) {
-width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH;
+width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH;
 
 // Subtract out the bits (is-set) vector width
-if (metadata.getDataMode() == DataMode.OPTIONAL) {
+if (isOptional) {
   width -= BIT_VECTOR_WIDTH;
 }
+
+if (isRepeated && getValueCount() == 0) {
+  return (safeDivide(width, STD_REPETITION_FACTOR));
--- End diff --

If the value count is zero, but the row count is non-zero, then a very low 
repetition rate is more realistic than 10. In earlier drafts, I found the 
repetition rate had to be a float since, in some data, the rate is something 
like 0.7 or 1.4. Rounding to an integer caused quite an error when multiplying 
by, say, 50K rows.

Any reason we can't use the actual computed amount here? If we really have 
1 or 2 rows, then a guess of 10 is fine. But, if we have 60K rows, with an 
actual estimate of 0, then guessing 10 will allocate 600K values when we 
probably needed close to 0. (Unless I'm missing something.)


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-22 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r183264768
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -277,18 +286,29 @@ public boolean isRepeatedList() {
 /**
  * This is the average per entry width, used for vector allocation.
  */
-public int getEntryWidth() {
+private int getEntryWidthForAlloc() {
   int width = 0;
   if (isVariableWidth) {
-width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH;
+width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH;
 
 // Subtract out the bits (is-set) vector width
-if (metadata.getDataMode() == DataMode.OPTIONAL) {
+if (isOptional) {
   width -= BIT_VECTOR_WIDTH;
 }
+
+if (isRepeated && getValueCount() == 0) {
+  return (safeDivide(width, STD_REPETITION_FACTOR));
+}
   }
 
-  return (safeDivide(width, cardinality));
+  return (safeDivide(width, getEntryCardinalityForAlloc()));
+}
+
+/**
+ * This is the average per entry cardinality, used for vector 
allocation.
+ */
+private float getEntryCardinalityForAlloc() {
+  return getCardinality() == 0 ? (isRepeated ? STD_REPETITION_FACTOR : 
1) :getCardinality();
--- End diff --

I'm a bit curious: under what scenario do we want to allocate vectors given 
no input rows? Can we just wait to see data before we do our allocations? It is 
very hard to make reasonable estimates of future size based on no data...


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-22 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r183265234
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/record/TestRecordBatchSizer.java
 ---
@@ -45,7 +45,7 @@
   private void verifyColumnValues(ColumnSize column, int 
stdDataSizePerEntry, int stdNetSizePerEntry,
   int dataSizePerEntry, int 
netSizePerEntry, int totalDataSize,
   int totalNetSize, int valueCount, int 
elementCount,
-  int estElementCountPerArray, boolean 
isVariableWidth) {
+  int cardinality, boolean 
isVariableWidth) {
--- End diff --

Maybe explain the cardinality here since I've seen it (or have used it) in 
three different ways:

* Type cardinality: (Required: 1, Optional: (0, 1), Repeated: *)
* Batch cardinality: the number of rows in a batch. (Or, equivalently, the 
number of top-level values in a vector.)
* Array cardinality: the number of values in an array (a single top-level 
value which is repeated)

"Cardinality" is a good word, may just need disambiguation.


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-22 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r183264235
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -50,7 +50,7 @@
 public class RecordBatchSizer {
   private static final int OFFSET_VECTOR_WIDTH = UInt4Vector.VALUE_WIDTH;
   private static final int BIT_VECTOR_WIDTH = UInt1Vector.VALUE_WIDTH;
-  private static final int STD_REPETITION_FACTOR = 10;
+  public static final int STD_REPETITION_FACTOR = 10;
--- End diff --

This is another of those silly fudge factors that really have no meaning. 
The value of 10 came from the vector allocation code in `AllocationHelper` (or 
I thought it did, the magic number there is 5.)

Maybe move this to `AllocationHelper` and set it to 5, then use it here and 
in `AllocationHelper` so we use a consistent guess everywhere.


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-22 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r183264996
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) {
*/
   private int netRowWidth;
   private int netRowWidthCap50;
+
+  /**
+   * actual row size if input is not empty. Otherwise, standard size.
+   */
+  private int rowAllocSize;
--- End diff --

I wonder if this all would be clearer if we handed it at size estimation 
time. If the row count is 0, set up everything using the standard sizes. (Note: 
the whole reason this class exists is that the standard sizes turned out to be 
*very* poor estimators of actual size.)

So, if we have no data, guess the same size as `AllocationHelper`, else use 
real sizes.

And, again the question: under what situation do we want to use the sizer 
if we don't actually have any data? For the first batch, if no data, just throw 
away the empty batch and don't size it. Turn around and get another until we 
receive a non-empty batch.

If we've already received at least one non-empty batch, then we receive an 
empty batch, we should just retain the estimates from the non-empty batch since 
they will be much better than just making up numbers.


---


[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...

2018-04-22 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1218#discussion_r183262311
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/RowSetLoaderImpl.java
 ---
@@ -95,4 +96,10 @@ public void endBatch() {
 
   @Override
   public int rowCount() { return rsLoader.rowCount(); }
+
+  @Override
+  public ColumnMetadata schema() {
+// No column for the row tuple
+return null;
--- End diff --

Expanded comment. Let me know if it is still not clear.


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-22 Thread sachouche
GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1237

DRILL-6348: Fixed code so that Unordered Receiver reports its memory …

Problem Description -
- The Unordered Receiver doesn't report any memory usage because the RPC 
infrastructure associated the allocated memory ownership to the minor fragment 
container

Proposed Fix -
- Modified the code to change the received DrillBuf memory ownership from 
the parent fragment to the operator (as discussed with @parthchandra)
- Made sure that the buffer accounting logic is preserved

@parthchandra, can you please review this pull request?

Thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6348

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1237


commit 9ab501a3fc3ebbda44c1478ad9db6711192b8ca1
Author: Salim Achouche 
Date:   2018-04-23T01:02:35Z

DRILL-6348: Fixed code so that Unordered Receiver reports its memory usage




---


[jira] [Created] (DRILL-6348) Unordered Receiver does not report its memory usage

2018-04-22 Thread salim achouche (JIRA)
salim achouche created DRILL-6348:
-

 Summary: Unordered Receiver does not report its memory usage
 Key: DRILL-6348
 URL: https://issues.apache.org/jira/browse/DRILL-6348
 Project: Apache Drill
  Issue Type: Task
  Components: Execution - Flow
Reporter: salim achouche
Assignee: salim achouche
 Fix For: 1.14.0


The Drill Profile functionality doesn't show any memory usage for the Unordered 
Receiver operator. This is problematic when analyzing OOM conditions since we 
cannot account for all of a query memory usage. This Jira is to fix memory 
reporting for the Unordered Receiver operator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)