[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98288738
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -426,6 +459,7 @@ public void setup(OperatorContext operatorContext, 
OutputMutator output) throws
   }
 } catch (Exception e) {
   handleAndRaise("Failure in setting up reader", e);
+} finally {
--- End diff --

yes. removed



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98288632
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -399,9 +430,11 @@ public void setup(OperatorContext operatorContext, 
OutputMutator output) throws
 getTypeLengthInBits(column.getType()), -1, column, 
columnChunkMetaData, false, repeatedVector, schemaElement));
   }
   else {
-
columnStatuses.add(ColumnReaderFactory.createFixedColumnReader(this, 
fieldFixedLength,
+
+   ColumnReader cr = 
ColumnReaderFactory.createFixedColumnReader(this, fieldFixedLength,
--- End diff --

Once again an artifact of some logging I put in then removed. Actually, I 
find this easier to read. Can I keep it this way? Please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98289161
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -179,10 +189,10 @@ private int getNextBlock() throws IOException {
 this.curPosInStream = getInputStream().getPos();
 bytesRead = nBytes;
 logger.trace(
--- End diff --

Sure. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98288871
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -43,7 +45,7 @@
 
   private static final int DEFAULT_BUFFER_SIZE = 8192 * 1024; // 8 MiB
   private static final int DEFAULT_TEMP_BUFFER_SIZE = 8192; // 8 KiB
-  private static final int SMALL_BUFFER_SIZE = 64 * 1024; // 64 KiB
+  private static final int SMALL_BUFFER_SIZE = 256 * 1024; // 64 KiB
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98268266
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java
 ---
@@ -41,26 +42,33 @@
 import java.io.IOException;
 import java.nio.ByteBuffer;
 import java.util.concurrent.Callable;
+import java.util.concurrent.ConcurrentLinkedQueue;
 import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Future;
+import java.util.concurrent.LinkedBlockingQueue;
 import java.util.concurrent.TimeUnit;
 
 import static org.apache.parquet.column.Encoding.valueOf;
 
 class AsyncPageReader extends PageReader {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(AsyncPageReader.class);
 
-
   private ExecutorService threadPool;
-  private Future asyncPageRead;
+  private long queueSize;
+  private LinkedBlockingQueue pageQueue;
+  private ConcurrentLinkedQueue asyncPageRead;
--- End diff --

Added a comment. Hopefully it makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98288225
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -197,6 +213,21 @@ public ParquetRecordReader(
   assert (numRecordsToRead >= 0);
   this.numRecordsToRead = Math.min(numRecordsToRead, 
footer.getBlocks().get(rowGroupIndex).getRowCount());
 }
+useAsyncColReader =
+
fragmentContext.getOptions().getOption(ExecConstants.PARQUET_COLUMNREADER_ASYNC).bool_val;
--- End diff --

Which sort of does the same thing after several function calls. Seems a 
waste. Plus you have to cast the option validator to the right type.
Mind if I leave it as it is?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98269966
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java
 ---
@@ -109,17 +122,30 @@ private DrillBuf getDecompressedPageData(ReadStatus 
readStatus) {
   }
 
   // Read and decode the dictionary and the header
-  private void readDictionaryPage(final Future asyncPageRead,
+  private void readDictionaryPage(/*final Future 
asyncPageRead,*/
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98271891
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java
 ---
@@ -282,89 +356,119 @@ public synchronized void setValuesRead(long 
valuesRead) {
   this.valuesRead = valuesRead;
 }
 
-public long getDiskScanTime() {
+public synchronized long getDiskScanTime() {
   return diskScanTime;
 }
 
-public void setDiskScanTime(long diskScanTime) {
+public synchronized void setDiskScanTime(long diskScanTime) {
   this.diskScanTime = diskScanTime;
 }
-  }
 
+  }
 
-  private class AsyncPageReaderTask implements Callable {
+  private class AsyncPageReaderTask implements Callable {
 
 private final AsyncPageReader parent = AsyncPageReader.this;
+private final LinkedBlockingQueue queue;
+private final String name;
 
-public AsyncPageReaderTask() {
+public AsyncPageReaderTask(String name, 
LinkedBlockingQueue queue) {
+  this.name = name;
+  this.queue = queue;
 }
 
-@Override public ReadStatus call() throws IOException {
+@Override
+public Boolean call() throws IOException {
   ReadStatus readStatus = new ReadStatus();
 
-  String oldname = Thread.currentThread().getName();
-  String name = 
parent.parentColumnReader.columnChunkMetaData.toString();
-  Thread.currentThread().setName(name);
-
   long bytesRead = 0;
   long valuesRead = 0;
+  final long totalValuesRead = parent.totalPageValuesRead;
   Stopwatch timer = Stopwatch.createStarted();
 
+  final long totalValuesCount = 
parent.parentColumnReader.columnChunkMetaData.getValueCount();
+
+  // if we are done, just put a marker object in the queue and we are 
done.
+  logger.trace("[{}]: Total Values COUNT {}  Total Values READ {} ", 
name, totalValuesCount, totalValuesRead);
+  if (totalValuesRead >= totalValuesCount) {
+try {
+  queue.put(ReadStatus.EMPTY);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  // Do nothing.
+}
+return true;
+  }
+
   DrillBuf pageData = null;
+  timer.reset();
   try {
 long s = parent.dataReader.getPos();
 PageHeader pageHeader = Util.readPageHeader(parent.dataReader);
-long e = parent.dataReader.getPos();
-if (logger.isTraceEnabled()) {
-  logger.trace("[{}]: Read Page Header : ReadPos = {} : Bytes Read 
= {} ", name, s, e - s);
-}
+//long e = parent.dataReader.getPos();
--- End diff --

I kept it there because it is useful to debug. I commented it out because I 
added logging at the trace level for performance measurement and this was 
creating a lot of noise in the logs. I'd like to keep it around in case we ever 
need it again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98270983
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java
 ---
@@ -176,12 +202,29 @@ private DrillBuf decompress(PageHeader pageHeader, 
DrillBuf compressedData) {
 return pageDataBuf;
   }
 
-  @Override protected void nextInternal() throws IOException {
+  @Override
+  protected void nextInternal() throws IOException {
 ReadStatus readStatus = null;
+String name = parentColumnReader.columnChunkMetaData.toString();
 try {
   Stopwatch timer = Stopwatch.createStarted();
-  readStatus = asyncPageRead.get();
+  
parentColumnReader.parentReader.getOperatorContext().getStats().startWait();
+  Future f = asyncPageRead.poll();
+  Boolean b = f.get(); // get the result of execution
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98269034
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java
 ---
@@ -41,26 +42,33 @@
 import java.io.IOException;
 import java.nio.ByteBuffer;
 import java.util.concurrent.Callable;
+import java.util.concurrent.ConcurrentLinkedQueue;
 import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Future;
+import java.util.concurrent.LinkedBlockingQueue;
 import java.util.concurrent.TimeUnit;
 
 import static org.apache.parquet.column.Encoding.valueOf;
 
 class AsyncPageReader extends PageReader {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(AsyncPageReader.class);
 
-
   private ExecutorService threadPool;
-  private Future asyncPageRead;
+  private long queueSize;
+  private LinkedBlockingQueue pageQueue;
+  private ConcurrentLinkedQueue asyncPageRead;
+  private long totalPageValuesRead = 0;
 
   AsyncPageReader(ColumnReader parentStatus, FileSystem fs, Path path,
   ColumnChunkMetaData columnChunkMetaData) throws 
ExecutionSetupException {
 super(parentStatus, fs, path, columnChunkMetaData);
-if (threadPool == null) {
+if (threadPool == null & asyncPageRead == null) {
--- End diff --

Thanks for catching this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98271038
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java
 ---
@@ -192,45 +235,74 @@ private DrillBuf decompress(PageHeader pageHeader, 
DrillBuf compressedData) {
 stats.timeDataPageLoads.addAndGet(timeBlocked + 
readStatus.getDiskScanTime());
   }
   pageHeader = readStatus.getPageHeader();
-  // reset this. At the time of calling close, if this is not null 
then a pending asyncPageRead needs to be consumed
-  asyncPageRead = null;
-} catch (Exception e) {
-  handleAndThrowException(e, "Error reading page data.");
-}
 
 // TODO - figure out if we need multiple dictionary pages, I believe 
it may be limited to one
 // I think we are clobbering parts of the dictionary if there can be 
multiple pages of dictionary
 
-do {
-  if (pageHeader.getType() == PageType.DICTIONARY_PAGE) {
-readDictionaryPageData(readStatus, parentColumnReader);
-// Ugly. Use the Async task to make a synchronous read call.
-readStatus = new AsyncPageReaderTask().call();
-pageHeader = readStatus.getPageHeader();
-  }
-} while (pageHeader.getType() == PageType.DICTIONARY_PAGE);
-
-if (parentColumnReader.totalValuesRead + readStatus.getValuesRead()
-< parentColumnReader.columnChunkMetaData.getValueCount()) {
-  asyncPageRead = threadPool.submit(new AsyncPageReaderTask());
-}
+  do {
+if (pageHeader.getType() == PageType.DICTIONARY_PAGE) {
+  readDictionaryPageData(readStatus, parentColumnReader);
+  asyncPageRead.poll().get(); // get the result of execution
+  synchronized (pageQueue) {
+boolean pageQueueFull = pageQueue.remainingCapacity() == 0;
+readStatus = pageQueue.take(); // get the data if no exception 
has been thrown
+if (readStatus.pageData == null || readStatus == 
ReadStatus.EMPTY) {
+  break;
+}
+//if the queue was full before we took a page out, then there 
would
+// have been no new read tasks scheduled. In that case, 
schedule a new read.
+if (pageQueueFull) {
+  asyncPageRead.offer(threadPool.submit(new 
AsyncPageReaderTask(debugName, pageQueue)));
+}
+  }
+  assert (readStatus.pageData != null);
+  pageHeader = readStatus.getPageHeader();
+}
+  } while (pageHeader.getType() == PageType.DICTIONARY_PAGE);
 
 pageHeader = readStatus.getPageHeader();
 pageData = getDecompressedPageData(readStatus);
-
+assert(pageData != null);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+} catch (Exception e){
+  handleAndThrowException(e, "Error reading page data");
+}
 
   }
 
-
   @Override public void clear() {
-if (asyncPageRead != null) {
+while (asyncPageRead != null && !asyncPageRead.isEmpty()) {
   try {
-final ReadStatus readStatus = asyncPageRead.get();
-readStatus.getPageData().release();
+Future f = asyncPageRead.poll();
+if(!f.isDone() && !f.isCancelled()){
+  f.cancel(true);
+} else {
+  Boolean b = f.get(1, TimeUnit.MILLISECONDS);
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98268204
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java
 ---
@@ -41,26 +42,33 @@
 import java.io.IOException;
 import java.nio.ByteBuffer;
 import java.util.concurrent.Callable;
+import java.util.concurrent.ConcurrentLinkedQueue;
 import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Future;
+import java.util.concurrent.LinkedBlockingQueue;
 import java.util.concurrent.TimeUnit;
 
 import static org.apache.parquet.column.Encoding.valueOf;
 
 class AsyncPageReader extends PageReader {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(AsyncPageReader.class);
 
-
   private ExecutorService threadPool;
-  private Future asyncPageRead;
+  private long queueSize;
+  private LinkedBlockingQueue pageQueue;
+  private ConcurrentLinkedQueue asyncPageRead;
--- End diff --

I had a return value of false on failure but then I changed that to throw 
Exceptions and left the boolean return value. Will change that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #723: DRILL-5207: Improve Parquet Scan pipelining.

2017-01-27 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/723#discussion_r98289478
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/DirectBufInputStream.java
 ---
@@ -81,7 +87,15 @@ public synchronized int read(DrillBuf buf, int off, int 
len) throws IOException
 ByteBuffer directBuffer = buf.nioBuffer(0, len);
 int lengthLeftToRead = len;
 while (lengthLeftToRead > 0) {
-  lengthLeftToRead -= CompatibilityUtil.getBuf(getInputStream(), 
directBuffer, lengthLeftToRead);
+  logger.trace("PERF: Disk read start. {}, StartOffset: {}, 
TotalByteSize: {}", this.streamId,
+  this.startOffset, this.totalByteSize);
+  Stopwatch timer = Stopwatch.createStarted();
+  int bytesRead = CompatibilityUtil.getBuf(getInputStream(), 
directBuffer, lengthLeftToRead);
+  lengthLeftToRead -= bytesRead;
+  logger.trace(
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Integration with Spark

2017-01-27 Thread Hanifi GUNES
I authored the initial Spark(DoS) plugin but it was never released due to
priorities. The initial implementation allows full duplex data exchange
between Drill and Spark. That is, we can use Drill to query your data lake
and do further iterative ML on it via Spark or vice versa.

I will need to rework in order to bring it to a publishable quality. Let me
know if you are interested in contributing.


-Hanifi

2016-05-27 11:23 GMT-07:00 Zhenrui(Jerry) Zhang <
zhenrui.zh...@salesforce.com>:

> Hi,
>
> Does anyone has any updates on the integration with Spark. The feature
> mentioned in https://drill.apache.org/blog/2014/12/16/whats-coming-in-
> 2015/
> and
> http://www.slideshare.net/SparkSummit/adding-complex-
> data-to-spark-stackneeraja-rentachintala
> ? Also there is an issue opened in JIRA(
> https://issues.apache.org/jira/browse/DRILL-3184) for this.
>
> Thanks,
> Jerry
>


[jira] [Created] (DRILL-5230) Translation of millisecond duration into hours is incorrect

2017-01-27 Thread Kunal Khatua (JIRA)
Kunal Khatua created DRILL-5230:
---

 Summary: Translation of millisecond duration into hours is 
incorrect
 Key: DRILL-5230
 URL: https://issues.apache.org/jira/browse/DRILL-5230
 Project: Apache Drill
  Issue Type: Bug
  Components: Web Server
Reporter: Kunal Khatua
Assignee: Kunal Khatua


The method 
{code:JAVA}org.apache.drill.exec.server.rest.profile.TableBuilder.appendMillis(long,
 String){code}
has a bug where the human readable translation of a 1+ hr duration in 
milliseconds is reported incorrectly. 
This has to do with the {code:JAVA}SimpleDateFormat.format() {code} method 
incorectly translating it. 

For e.g.
{code:JAVA}
long x = 4545342L; //1 hour 15 min 45.342 sec
public void appendMillis(x, null);
{code}
This formats the value as {noformat}17h15m{noformat} instead of 
{noformat}1h15m{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request #704: DRILL-5125: Provide option to use generic code for ...

2017-01-27 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/704#discussion_r98299585
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/svremover/TestSVRemover.java
 ---
@@ -34,4 +38,33 @@ public void testSVRWithNoFilter() throws Exception {
 int numOutputRecords = 
testPhysical(getFile("remover/sv_with_no_filter.json"));
 assertEquals(100, numOutputRecords);
   }
+
+  /**
+   * Test the generic version of the selection vector remover copier
+   * class. The code uses the traditional generated version by default.
+   * This test sets the option to use the generic version, then runs
+   * a query that exercises that version.
+   * 
+   * Note that the tests here exercise only the SV2 version of the
+   * selection remover; no tests exist for the SV4 version.
+   */
+
+  // TODO: Add an SV4 test once the improved mock data generator
+  // is available.
+
+  @Test
+  public void testGenericCopier() throws Exception {
+// TODO: replace this with new setup once revised test framework
+// is available.
+Properties config = new Properties( );
+config.put(ExecConstants.SYS_STORE_PROVIDER_LOCAL_ENABLE_WRITE, 
"false");
+config.put(ExecConstants.HTTP_ENABLE, "false");
+config.put(ExecConstants.REMOVER_ENABLE_GENERIC_COPIER, "true");
+updateTestCluster(1, DrillConfig.create(config));
+
+int numOutputRecords = testPhysical(getFile("remover/test1.json"));
+assertEquals(50, numOutputRecords);
+numOutputRecords = 
testPhysical(getFile("remover/sv_with_no_filter.json"));
+assertEquals(100, numOutputRecords);
+  }
 }
--- End diff --

Do these tests cover all the value vector types ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #704: DRILL-5125: Provide option to use generic code for ...

2017-01-27 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/704#discussion_r97992610
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/svremover/GenericSV4Copier.java
 ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.svremover;
+
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+
+/**
+ * Generic selection vector 4 copier implementation that can
+ * be used in place of the generated version. Relies on a
+ * virtual function in each value vector to choose the proper
+ * implementation. Tests suggest that this version performs
+ * better than the generated version for queries with many columns.
+ */
+
+public class GenericSV4Copier extends CopierTemplate4 {
+
+  private ValueVector[] vvOut;
+  private ValueVector[][] vvIn;
+
+  @SuppressWarnings("unused")
+  @Override
+  public void doSetup(FragmentContext context, RecordBatch incoming,
+  RecordBatch outgoing) {
--- End diff --

fix alignment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Dynamic UDF Registration

2017-01-27 Thread Charles Givre
I’m having some trouble registering a UDF I wrote.  The UDF was working in 
previous versions of Drill and now Drill doesn’t seem to recognize it.  When I 
try to register it I get the following error:

: jdbc:drill:zk=local> create function using jar 
'drill-geoip-functions-1.0.jar';
++---+
|   ok   |summary|
++---+
| false  | Jar drill-geoip-functions-1.0.jar does not contain functions  |
++---+

I do have a drill-module.conf file and here are the contents:
drill.classpath.scanning.packages += "org.apache.drill.contrib.function"
I tried moving this file around and I put a copy in the /src folder as well as 
the /src/main folder to no avail.  Any suggestions?
Thanks,
— Charles







Re: Data types

2017-01-27 Thread Charles Givre
I’m actually one of the contributors for the forthcoming O’Reilly book on Drill 
(along with Ted and Ellen), and this is a specific functionality I’m planning 
on writing a chapter about.  (Not the buffers, but how to get Drill to ingest 
other file formats)



 
> On Jan 27, 2017, at 11:50, Paul Rogers  wrote:
> 
> Hi Charles,
> 
> Congrats! Unfortunately, no, there is no documentation. Drill seems to be of 
> the “code speaks for itself” persuasion. I try to document the bits I’ve had 
> to learn on my Github Wiki, but (until now) I’ve not looked at this 
> particular area.
> 
> IMHO, now that the plugins basically work, the API could use a good scrubbing 
> to make it simpler, easier to document, and easier to use. As it is, you have 
> to be an expert on Drill internals to understand all the little knick-knacks 
> that have to be in your code to make various Drill subsystems happy.
> 
> That said, perhaps you can use your own Git Wiki to document what you’ve 
> learned so that we capture that for the next plugin developer.
> 
> Thanks,
> 
> - Paul
> 
>> On Jan 27, 2017, at 8:42 AM, Charles Givre  wrote:
>> 
>> Hi Paul,
>> VICTORY!!  I just set the buffer size to 4096 and it worked perfectly 
>> without truncating my data! 
>> Is this documented anywhere?  I’ve been trying to really wrap my head around 
>> the mechanics of how Drill reads data and how the format plugins work and 
>> really haven’t found much.  I’ve hacked together a few other plugins like 
>> this—which work—but if I could find some docs, that would be great.
>> Thanks,
>> — Charles
>> 
>> 
>> 
>>> On Jan 27, 2017, at 02:11, Paul Rogers  wrote:
>>> 
>>> Looks like I gave you advice that as a bit off. The function you want is 
>>> either:
>>> 
>>>  this.buffer = fragmentContext.getManagedBuffer();
>>> 
>>> The above allocates a 256 byte buffer. You can initially allocate a larger 
>>> one:
>>> 
>>>  this.buffer = fragmentContext.getManagedBuffer(4096);
>>> 
>>> Or, to reallocate:
>>> 
>>> buffer = fragmentContext.replace(buffer, 8192);
>>> 
>>> Again, I’ve not used these method myself, but they seem they might do the 
>>> trick.
>>> 
>>> - Paul
>>> 
 On Jan 26, 2017, at 9:51 PM, Charles Givre  wrote:
 
 Thanks!  I’m hoping to submit a PR eventually once I have this all done.  
 I tried your changes and now I’m getting this error:
 
 0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`;
 Error: DATA_READ ERROR: Tried to remove unmanaged buffer.
 
 Fragment 0:0
 
 [Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on 
 charless-mbp-2.fios-router.home:31010] (state=,code=0)
 
 
 
 
> On Jan 26, 2017, at 23:08, Paul Rogers  wrote:
> 
> Hi Charles,
> 
> Very cool plugin!
> 
> My knowledge in this area is a bit sketchy… That said, the problem 
> appears to be that the code does not extend the Drillbuf to ensure it has 
> sufficient capacity. Try calling this method: reallocIfNeeded, something 
> like this:
> 
>   this.buffer.reallocIfNeeded(stringLength);
>   this.buffer.setBytes(0, bytes, 0, stringLength);
>   map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
> 
> Then, comment out the 256 length hack and see if it works.
> 
> To avoid memory fragmentation, maybe change your loop as:
> 
>int maxRecords = MAX_RECORDS_PER_BATCH;
>int maxWidth = 256;
>while(recordCount < maxRecords &&(line = this.reader.readLine()) 
> != null){
>…
>   if(stringLength > maxWidth) {
>  maxWidth = stringLength;
>  maxRecords = 16 * 1024 * 1024 / maxWidth;
>   }
> 
> The above is not perfect (the last record added might be much larger than 
> the others, causing the corresponding vector to grow larger than 16 MB, 
> but the occasional large vector should be OK.)
> 
> Thanks,
> 
> - Paul
> 
> On Jan 26, 2017, at 5:31 PM, Charles Givre 
> > wrote:
> 
> Hi Paul,
> Would you mind taking a look at my code?  I’m wondering if I’m doing this 
> correctly.  Just for context, I’m working on a generic log file reader 
> for drill (https://github.com/cgivre/drill-logfile-plugin 
> ), and I encountered some 
> errors when working with fields that were > 256 characters long.  It 
> isn’t a storage plugin, but it extends the EasyFormatPlugin.
> 
> I added some code to truncate the strings to 256 chars, it worked.  
> Before this it was throwing errors as shown below:
> 
> 
> 
> Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
> 
> Fragment 0:0
> 
> [Error Id: 

Re: Data types

2017-01-27 Thread Paul Rogers
Hi Charles,

Congrats! Unfortunately, no, there is no documentation. Drill seems to be of 
the “code speaks for itself” persuasion. I try to document the bits I’ve had to 
learn on my Github Wiki, but (until now) I’ve not looked at this particular 
area.

IMHO, now that the plugins basically work, the API could use a good scrubbing 
to make it simpler, easier to document, and easier to use. As it is, you have 
to be an expert on Drill internals to understand all the little knick-knacks 
that have to be in your code to make various Drill subsystems happy.

That said, perhaps you can use your own Git Wiki to document what you’ve 
learned so that we capture that for the next plugin developer.

Thanks,

- Paul

> On Jan 27, 2017, at 8:42 AM, Charles Givre  wrote:
> 
> Hi Paul,
> VICTORY!!  I just set the buffer size to 4096 and it worked perfectly without 
> truncating my data! 
> Is this documented anywhere?  I’ve been trying to really wrap my head around 
> the mechanics of how Drill reads data and how the format plugins work and 
> really haven’t found much.  I’ve hacked together a few other plugins like 
> this—which work—but if I could find some docs, that would be great.
> Thanks,
> — Charles
> 
> 
> 
>> On Jan 27, 2017, at 02:11, Paul Rogers  wrote:
>> 
>> Looks like I gave you advice that as a bit off. The function you want is 
>> either:
>> 
>>   this.buffer = fragmentContext.getManagedBuffer();
>> 
>> The above allocates a 256 byte buffer. You can initially allocate a larger 
>> one:
>> 
>>   this.buffer = fragmentContext.getManagedBuffer(4096);
>> 
>> Or, to reallocate:
>> 
>>  buffer = fragmentContext.replace(buffer, 8192);
>> 
>> Again, I’ve not used these method myself, but they seem they might do the 
>> trick.
>> 
>> - Paul
>> 
>>> On Jan 26, 2017, at 9:51 PM, Charles Givre  wrote:
>>> 
>>> Thanks!  I’m hoping to submit a PR eventually once I have this all done.  I 
>>> tried your changes and now I’m getting this error:
>>> 
>>> 0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`;
>>> Error: DATA_READ ERROR: Tried to remove unmanaged buffer.
>>> 
>>> Fragment 0:0
>>> 
>>> [Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on 
>>> charless-mbp-2.fios-router.home:31010] (state=,code=0)
>>> 
>>> 
>>> 
>>> 
 On Jan 26, 2017, at 23:08, Paul Rogers  wrote:
 
 Hi Charles,
 
 Very cool plugin!
 
 My knowledge in this area is a bit sketchy… That said, the problem appears 
 to be that the code does not extend the Drillbuf to ensure it has 
 sufficient capacity. Try calling this method: reallocIfNeeded, something 
 like this:
 
this.buffer.reallocIfNeeded(stringLength);
this.buffer.setBytes(0, bytes, 0, stringLength);
map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
 
 Then, comment out the 256 length hack and see if it works.
 
 To avoid memory fragmentation, maybe change your loop as:
 
 int maxRecords = MAX_RECORDS_PER_BATCH;
 int maxWidth = 256;
 while(recordCount < maxRecords &&(line = this.reader.readLine()) 
 != null){
 …
if(stringLength > maxWidth) {
   maxWidth = stringLength;
   maxRecords = 16 * 1024 * 1024 / maxWidth;
}
 
 The above is not perfect (the last record added might be much larger than 
 the others, causing the corresponding vector to grow larger than 16 MB, 
 but the occasional large vector should be OK.)
 
 Thanks,
 
 - Paul
 
 On Jan 26, 2017, at 5:31 PM, Charles Givre 
 > wrote:
 
 Hi Paul,
 Would you mind taking a look at my code?  I’m wondering if I’m doing this 
 correctly.  Just for context, I’m working on a generic log file reader for 
 drill (https://github.com/cgivre/drill-logfile-plugin 
 ), and I encountered some 
 errors when working with fields that were > 256 characters long.  It isn’t 
 a storage plugin, but it extends the EasyFormatPlugin.
 
 I added some code to truncate the strings to 256 chars, it worked.  Before 
 this it was throwing errors as shown below:
 
 
 
 Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
 
 Fragment 0:0
 
 [Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
 charless-mbp-2.fios-router.home:31010] (state=,code=0)
 
 
 The query that generated this was just a SELECT * FROM dfs.`file`.  Also, 
 how do I set the size of each row batch?
 Thank you for your help.
 — C
 
 
 if (m.find()) {
 for( int i = 1; i <= m.groupCount(); i++ )
 {
//TODO Add option for date fields
String fieldName  = fieldNames.get(i - 1);
String 

Re: Data types

2017-01-27 Thread Charles Givre
Hi Paul,
VICTORY!!  I just set the buffer size to 4096 and it worked perfectly without 
truncating my data! 
Is this documented anywhere?  I’ve been trying to really wrap my head around 
the mechanics of how Drill reads data and how the format plugins work and 
really haven’t found much.  I’ve hacked together a few other plugins like 
this—which work—but if I could find some docs, that would be great.
Thanks,
— Charles
 


> On Jan 27, 2017, at 02:11, Paul Rogers  wrote:
> 
> Looks like I gave you advice that as a bit off. The function you want is 
> either:
> 
>this.buffer = fragmentContext.getManagedBuffer();
> 
> The above allocates a 256 byte buffer. You can initially allocate a larger 
> one:
> 
>this.buffer = fragmentContext.getManagedBuffer(4096);
> 
> Or, to reallocate:
> 
>   buffer = fragmentContext.replace(buffer, 8192);
> 
> Again, I’ve not used these method myself, but they seem they might do the 
> trick.
> 
> - Paul
> 
>> On Jan 26, 2017, at 9:51 PM, Charles Givre  wrote:
>> 
>> Thanks!  I’m hoping to submit a PR eventually once I have this all done.  I 
>> tried your changes and now I’m getting this error:
>> 
>> 0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`;
>> Error: DATA_READ ERROR: Tried to remove unmanaged buffer.
>> 
>> Fragment 0:0
>> 
>> [Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on 
>> charless-mbp-2.fios-router.home:31010] (state=,code=0)
>> 
>> 
>> 
>> 
>>> On Jan 26, 2017, at 23:08, Paul Rogers  wrote:
>>> 
>>> Hi Charles,
>>> 
>>> Very cool plugin!
>>> 
>>> My knowledge in this area is a bit sketchy… That said, the problem appears 
>>> to be that the code does not extend the Drillbuf to ensure it has 
>>> sufficient capacity. Try calling this method: reallocIfNeeded, something 
>>> like this:
>>> 
>>> this.buffer.reallocIfNeeded(stringLength);
>>> this.buffer.setBytes(0, bytes, 0, stringLength);
>>> map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>>> 
>>> Then, comment out the 256 length hack and see if it works.
>>> 
>>> To avoid memory fragmentation, maybe change your loop as:
>>> 
>>>  int maxRecords = MAX_RECORDS_PER_BATCH;
>>>  int maxWidth = 256;
>>>  while(recordCount < maxRecords &&(line = this.reader.readLine()) 
>>> != null){
>>>  …
>>> if(stringLength > maxWidth) {
>>>maxWidth = stringLength;
>>>maxRecords = 16 * 1024 * 1024 / maxWidth;
>>> }
>>> 
>>> The above is not perfect (the last record added might be much larger than 
>>> the others, causing the corresponding vector to grow larger than 16 MB, but 
>>> the occasional large vector should be OK.)
>>> 
>>> Thanks,
>>> 
>>> - Paul
>>> 
>>> On Jan 26, 2017, at 5:31 PM, Charles Givre 
>>> > wrote:
>>> 
>>> Hi Paul,
>>> Would you mind taking a look at my code?  I’m wondering if I’m doing this 
>>> correctly.  Just for context, I’m working on a generic log file reader for 
>>> drill (https://github.com/cgivre/drill-logfile-plugin 
>>> ), and I encountered some 
>>> errors when working with fields that were > 256 characters long.  It isn’t 
>>> a storage plugin, but it extends the EasyFormatPlugin.
>>> 
>>> I added some code to truncate the strings to 256 chars, it worked.  Before 
>>> this it was throwing errors as shown below:
>>> 
>>> 
>>> 
>>> Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
>>> 
>>> Fragment 0:0
>>> 
>>> [Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
>>> charless-mbp-2.fios-router.home:31010] (state=,code=0)
>>> 
>>> 
>>> The query that generated this was just a SELECT * FROM dfs.`file`.  Also, 
>>> how do I set the size of each row batch?
>>> Thank you for your help.
>>> — C
>>> 
>>> 
>>> if (m.find()) {
>>> for( int i = 1; i <= m.groupCount(); i++ )
>>> {
>>> //TODO Add option for date fields
>>> String fieldName  = fieldNames.get(i - 1);
>>> String fieldValue;
>>> 
>>> fieldValue = m.group(i);
>>> 
>>> if( fieldValue == null){
>>> fieldValue = "";
>>> }
>>> byte[] bytes = fieldValue.getBytes("UTF-8");
>>> 
>>> //Added this and it worked….
>>> int stringLength = bytes.length;
>>> if( stringLength > 256 ){
>>> stringLength = 256;
>>> }
>>> 
>>> this.buffer.setBytes(0, bytes, 0, stringLength);
>>> map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>>> }
>>> 
>>> 
>>> 
>>> 
>>> On Jan 26, 2017, at 20:20, Paul Rogers 
>>> > wrote:
>>> 
>>> Hi Charles,
>>> 
>>> The Varchar column can hold any length of data. We’ve recently been working 
>>> on tests that have columns up to 8K in length.
>>> 
>>> The one caveat is that, when working with data larger than 256 bytes, you 
>>> must be extremely careful in your reader. The out-of-box text reader will 
>>> 

[GitHub] drill pull request #580: DRILL-4824: JSON with complex nested data produces ...

2017-01-27 Thread Serhii-Harnyk
Github user Serhii-Harnyk commented on a diff in the pull request:

https://github.com/apache/drill/pull/580#discussion_r98232198
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/vector/complex/MapVector.java 
---
@@ -317,6 +317,12 @@ public Object getObject(int index) {
 if (v != null && index < v.getAccessor().getValueCount()) {
   Object value = v.getAccessor().getObject(index);
   if (value != null) {
+if ((v.getAccessor().getObject(index) instanceof Map
+&& ((Map) v.getAccessor().getObject(index)).size() == 
0)
+|| (v.getAccessor().getObject(index) instanceof List
+&& ((List) v.getAccessor().getObject(index)).size() == 
0)) {
+  continue;
+}
--- End diff --

@paul-rogers, map fields have data mode required and they are the part of 
the schema, that's why there are no difference between missing field in some 
record, and the field that exists but empty. 
This fix for your example will return result
`{"a":{"b":10},"c":[1,2,3]}`
`{"c":[4]}`
`{"b":[5]}`
`{"a":{"b":20}}`
`{"a":{"b":20}}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #731: DRILL-5224: CTTAS: fix errors connected with system...

2017-01-27 Thread arina-ielchiieva
GitHub user arina-ielchiieva opened a pull request:

https://github.com/apache/drill/pull/731

DRILL-5224: CTTAS: fix errors connected with system path delimiters (…

…Windows)

Replaced `java.nio.file.Paths.get()` to `org.apache.hadoop.fs.Path` as the 
latest works the same disregarding the system. Errors related to 
`java.nio.file.Paths.get()` usage are described in Jira DRILL-5224.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/arina-ielchiieva/drill DRILL-5224

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/731.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #731


commit 87a2d270633eeac7bd78172b6b0935ab4ead344e
Author: Arina Ielchiieva 
Date:   2017-01-26T18:14:28Z

DRILL-5224: CTTAS: fix errors connected with system path delimiters 
(Windows)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #685: Drill 5043: Function that returns a unique id per s...

2017-01-27 Thread arina-ielchiieva
Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/685#discussion_r98186438
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/ContextFunctions.java
 ---
@@ -64,17 +65,45 @@ public void eval() {
 @Inject DrillBuf buffer;
 @Workspace int currentSchemaBytesLength;
 
+@Override
 public void setup() {
   final byte[] currentSchemaBytes = 
contextInfo.getCurrentDefaultSchema().getBytes();
   buffer = buffer.reallocIfNeeded(currentSchemaBytes.length);
   currentSchemaBytesLength= currentSchemaBytes.length;
   buffer.setBytes(0, currentSchemaBytes);
 }
 
+@Override
 public void eval() {
   out.start = 0;
   out.end = currentSchemaBytesLength;
   out.buffer = buffer;
 }
   }
+
+  /**
+   * Implement "session_id" function. Returns the unique id of the current 
session.
+   */
+  @FunctionTemplate(name = "session_id", scope = 
FunctionTemplate.FunctionScope.SIMPLE, isNiladic = true)
--- End diff --

Each drill function is converted to DrillFunctionHolder and though Calcite 
recognizes current_time/current_date & other context functions (ex: user, 
current_schema) as niladic, after introduction of isNilladic flag, 
DrillFunctionHolder would describe such functions as non-niladic, though they 
are. So we are intentionally hold incorrect info.

Let's say in future I would need to take list of all function holders in 
Drill (form local function registry) and filter out all niladic functions, 
since we do have such isNiladic flag I would assume that I can use this flag as 
filter but since current_time/current_date & other context functions (ex: user, 
current_schema) have isniladic=false, they would never be filtered out.

Also please add `isNiladic=true` to other niladic context functions [1].

[1] 
https://github.com/apache/drill/blob/5a4ad2a88331dfe7561ee76fc87e882afe170681/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/ContextFunctions.java



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---