[GitHub] drill issue #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsException ...

2018-04-02 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1144
  
It is not clear why get/set Byte/Char/Short/Int/Long/Float/Double do not 
delegate to UDLE, while get/set Bytes delegates to UDLE and relies on netty 
'AbstractByteBuf` for bounds checking. IMO, it will be good to have the 
behavior consistent for all methods.

In many cases including `VariableLengthVectors`, there is no need to rely 
on UDLE boundary checking as a caller already provides or can provide a 
guarantee that an index is within a buffer boundaries. In those cases, boundary 
check becomes an extra cost. IMO, it will be good to have a consistent behavior 
with ability to enable bounds checking for debugging.


---


[GitHub] drill pull request #258: DRILL-4091: Support for additional gis operations i...

2018-04-02 Thread ChrisSandison
Github user ChrisSandison commented on a diff in the pull request:

https://github.com/apache/drill/pull/258#discussion_r178652351
  
--- Diff: 
contrib/gis/src/main/java/org/apache/drill/exec/expr/fn/impl/gis/STUnionAggregate.java
 ---
@@ -0,0 +1,114 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.impl.gis;
+
+import javax.inject.Inject;
+
+import org.apache.drill.exec.expr.DrillAggFunc;
+import org.apache.drill.exec.expr.annotations.FunctionTemplate;
+import org.apache.drill.exec.expr.annotations.Output;
+import org.apache.drill.exec.expr.annotations.Param;
+import org.apache.drill.exec.expr.annotations.Workspace;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.IntHolder;
+import org.apache.drill.exec.expr.holders.NullableVarBinaryHolder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.UInt1Holder;
+
+import com.esri.core.geometry.SpatialReference;
+
+import io.netty.buffer.DrillBuf;
+
+@FunctionTemplate(name = "st_unionaggregate", scope = 
FunctionTemplate.FunctionScope.POINT_AGGREGATE)
--- End diff --

Is there documentation for that for aggregate functions?


---


Re: "Death of Schema-on-Read"

2018-04-02 Thread Ted Dunning
On Mon, Apr 2, 2018 at 10:54 AM, Aman Sinha  wrote:

> ...
> Although, one may argue that XML died because of the weight of the extra
> structure added to it and people just gravitated towards JSON.
>

My argument would be that it died because it couldn't distinguish well
between an element and a list of elements of length 1.

JSON avoids that kind of problem.


> In that respect,  Avro provides a good middle ground.   A similar approach
> is taken by MapR-DB  JSON database which has data type information for the
> fields of a JSON document.
>

True that.

But another middle ground representation is a JSON with a side file
describing type information derived when the file was previously read.

That said, we still have to (a) deal with JSON data which is one of the
> most prevalent format in big data space and (b) still have to handle schema
> changes even with Avro-like formats.
>

This is a big deal.

To some degree, a lot of this can be handled by two simple mechanisms:

1) record what we learn when scanning a file.  That is, if a column is null
(or missing) until the final record when it is a float, remember that. This
allows subsequent queries to look further ahead when deciding what is
happening in a query.

2) allow queries to be restarted when it is discovered that type
assumptions are untenable. Currently, schema change is what we call this
situation where we can't really recover from mistaken assumptions that are
derived incrementally as we scan the data. If we had (1), then the
information obtained by the reading that we have done up to the point that
schema change was noted could be preserved. That means that we could
restart the query with the knowledge of the data types that might later
cause a schema change exception. In many cases, that would allow us to
avoid that exception entirely on the second pass through the data.

In most cases, restarts would not be necessary. I know this because schema
change exceptions are currently pretty rare and they would be even more
rare if we learned about file schemas from experience. Even when a new file
is seen for the first time, schema change wouldn't happen. As such, the
amortized cost of restarts would be very low. On the other hand, the
advantage of such a mechanism would be that more queries would succeed and
users would be happier.


> ...
> From Drill's perspective, we have in the past discussed the need for 2
> modes:
>  - A fixed schema mode which operates in a manner similar to the RDBMSs.
> This is needed not just to resolve ambiguities but also for performance.
> Why treat a column as nullable when data is non-nullable ?
>  - A variable schema mode which is what it does today...but this part needs
> to be enhanced to be *'declarative' such that ambiguities are removed.*   A
> user may choose not to create any declaration, in which case Drill would
> default to certain documented set of rules that do type conversions.
>

The restart suggestion above avoids the need for modes but also allows the
performance of the fixed schema mode in most cases.


Re: "Death of Schema-on-Read"

2018-04-02 Thread Paul Rogers
Just to clarify, the article seemed to indicate that Comcast has an Avro file 
for each of their Kafka data sources, and that file contains metadata 
information. The analogy in Drill would be if we had an ambiguous JSON file, 
along with an Avro file (say) that defined the columns, their data types, their 
names, and so on. The exact Comcast design probably wouldn't fit Drill. It is 
the concept that is thought-provoking.

Today, we can do a pretty good job with JSON as long as it is a very clean 
schema:
* No null or missing fields in the first record.* Consistent data types.* Same 
set of fields in every file.

The easiest problem to visualize is when something is missing, Drill has to 
guess, there are multiple choices, and Drill guesses wrong. Classic example, a 
field is missing and we guess "Nullable Int" when it is, in fact, a VarChar. 
Yes, we could drop this "dangling" field, but doing so might be somewhat 
surprising to the user.

For Parquet, we have to deal only with the missing-field problem (i.e. schema 
evolution.) With CSV, we have the problem of knowing the actual data type of a 
column (is column "price" really text, or is it text that represents a number? 
Of what type?) And so it goes.

With regard to the two modes; we could even have a single mode in which we use 
metadata when it is present, and we guess otherwise. With clean schemas 
(Parquet files with identical schemas), Drill's guesses are sufficient. But, 
when the situation is ambiguous, we could allow the user to specify just enough 
metadata to resolve the ambiguity. (In Parquet, say, if field "x" was added 
later, then the metadata could just say that, "when column 'x' is missing, 
assume it is 'Date'".)


If the user wants to specify everything, then that is just a special case of 
the "just enough schema" model.

I believe Drill can use Hive, but only for the Hive readers. So, a possible 
first step is to combine the Hive metastore schema information with the Drill 
native readers.

In an ideal world, Drill would have a "schema plugin" along with its storage 
and format plugins, so Drill can integrate with a variety of metadata systems 
(including Comcast's unique Avro schema files.) Even better if the schema hints 
could also be provided via Drill's existing table functions for ad-hoc use.

All of this is just something to keep in the back of our minds as we think 
about how to resolve schema change issues.

Thanks,
- Paul

 

On Monday, April 2, 2018, 10:54:23 AM PDT, Aman Sinha 
 wrote:  
 
 It is certainly a huge advantage to have embedded data type information in
the data such as provided by Avro format.  In the past, XML also had
schemas and DTDs.
Although, one may argue that XML died because of the weight of the extra
structure added to it and people just gravitated towards JSON.
In that respect,  Avro provides a good middle ground.  A similar approach
is taken by MapR-DB  JSON database which has data type information for the
fields of a JSON document.

That said, we still have to (a) deal with JSON data which is one of the
most prevalent format in big data space and (b) still have to handle schema
changes even with Avro-like formats.
Comcast's view point suggests the one-size-fits-all approach but there is
counter-points to that, for instance as mentioned here [1].  It would be
very useful to have a survey of other users/companies that are dealing with
the schema evolution issues to get a better understanding of whether
Comcast's experience is a broader trend.

>From Drill's perspective, we have in the past discussed the need for 2
modes:
 - A fixed schema mode which operates in a manner similar to the RDBMSs.
This is needed not just to resolve ambiguities but also for performance.
Why treat a column as nullable when data is non-nullable ?
 - A variable schema mode which is what it does today...but this part needs
to be enhanced to be *'declarative' such that ambiguities are removed.*  A
user may choose not to create any declaration, in which case Drill would
default to certain documented set of rules that do type conversions.


[1] https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/


-Aman


On Sun, Apr 1, 2018 at 10:46 PM, Paul Rogers 
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema 

[GitHub] drill pull request #258: DRILL-4091: Support for additional gis operations i...

2018-04-02 Thread ChrisSandison
Github user ChrisSandison commented on a diff in the pull request:

https://github.com/apache/drill/pull/258#discussion_r178619522
  
--- Diff: 
contrib/gis/src/main/java/org/apache/drill/exec/expr/fn/impl/gis/STXFunc.java 
---
@@ -0,0 +1,64 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.impl.gis;
+
+import java.sql.Types;
+
+import javax.inject.Inject;
+
+import org.apache.drill.exec.expr.DrillSimpleFunc;
+import org.apache.drill.exec.expr.annotations.FunctionTemplate;
+import org.apache.drill.exec.expr.annotations.Output;
+import org.apache.drill.exec.expr.annotations.Param;
+import org.apache.drill.exec.expr.holders.Float8Holder;
+import org.apache.drill.exec.expr.holders.VarBinaryHolder;
+
+import com.esri.core.geometry.Geometry.Type;
+import com.esri.core.geometry.ogc.OGCPoint;
+
+import io.netty.buffer.DrillBuf;
+
+@FunctionTemplate(name = "st_x", scope = 
FunctionTemplate.FunctionScope.SIMPLE,
+  nulls = FunctionTemplate.NullHandling.NULL_IF_NULL)
+public class STXFunc implements DrillSimpleFunc {
+  @Param
+  VarBinaryHolder geomParam;
+
+  @Output
+  Float8Holder out;
+
+  @Inject
+  DrillBuf buffer;
+
+  public void setup() {
+  }
+
+  public void eval() {
+
+com.esri.core.geometry.ogc.OGCGeometry geom;
+
+geom = com.esri.core.geometry.ogc.OGCGeometry
+.fromBinary(geomParam.buffer.nioBuffer(geomParam.start, 
geomParam.end - geomParam.start));
+
+if(geom != null && geom.geometryType().equals("Point")){
+  out.value = ((com.esri.core.geometry.ogc.OGCPoint) geom).X();
+} else {
+  out.value = Double.NaN;
--- End diff --

@cgivre it looks like the assigning of `NaN` is breaking the test suite. Is 
this the behaviour, or could this be the default null handling that is 
provided? Ditto for other times this is assigned


---


Re: "Death of Schema-on-Read"

2018-04-02 Thread Aman Sinha
It is certainly a huge advantage to have embedded data type information in
the data such as provided by Avro format.  In the past, XML also had
schemas and DTDs.
Although, one may argue that XML died because of the weight of the extra
structure added to it and people just gravitated towards JSON.
In that respect,  Avro provides a good middle ground.   A similar approach
is taken by MapR-DB  JSON database which has data type information for the
fields of a JSON document.

That said, we still have to (a) deal with JSON data which is one of the
most prevalent format in big data space and (b) still have to handle schema
changes even with Avro-like formats.
Comcast's view point suggests the one-size-fits-all approach but there is
counter-points to that, for instance as mentioned here [1].  It would be
very useful to have a survey of other users/companies that are dealing with
the schema evolution issues to get a better understanding of whether
Comcast's experience is a broader trend.

>From Drill's perspective, we have in the past discussed the need for 2
modes:
 - A fixed schema mode which operates in a manner similar to the RDBMSs.
This is needed not just to resolve ambiguities but also for performance.
Why treat a column as nullable when data is non-nullable ?
 - A variable schema mode which is what it does today...but this part needs
to be enhanced to be *'declarative' such that ambiguities are removed.*   A
user may choose not to create any declaration, in which case Drill would
default to certain documented set of rules that do type conversions.


[1] https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/


-Aman


On Sun, Apr 1, 2018 at 10:46 PM, Paul Rogers 
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6=em-data-na-na-newsltr_20180328
>
>
>
>
>


[jira] [Created] (DRILL-6306) Should not be able to run queries against disabled storage plugins

2018-04-02 Thread Krystal (JIRA)
Krystal created DRILL-6306:
--

 Summary: Should not be able to run queries against disabled 
storage plugins
 Key: DRILL-6306
 URL: https://issues.apache.org/jira/browse/DRILL-6306
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Other
Affects Versions: 1.13.0
Reporter: Krystal


Currently, queries against disabled storage plugins are returning data.  This 
should not be the case.  Queries against disabled storage plugins should fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] drill pull request #1182: DRILL-6287: apache-release profile should be disab...

2018-04-02 Thread vdiravka
Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/1182#discussion_r178572285
  
--- Diff: pom.xml ---
@@ -66,6 +66,7 @@
 
 4096
 4096
+-Xdoclint:none
--- End diff --

Thanks


---


[GitHub] drill pull request #1182: DRILL-6287: apache-release profile should be disab...

2018-04-02 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1182#discussion_r178567909
  
--- Diff: pom.xml ---
@@ -66,6 +66,7 @@
 
 4096
 4096
+-Xdoclint:none
--- End diff --

@vdiravka Please see DRILL-4547.


---


[GitHub] drill pull request #1182: DRILL-6287: apache-release profile should be disab...

2018-04-02 Thread vdiravka
Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/1182#discussion_r178561972
  
--- Diff: pom.xml ---
@@ -66,6 +66,7 @@
 
 4096
 4096
+-Xdoclint:none
--- End diff --

Do we need a task (new Jira) for refactoring the Drill java docs and moving 
onto Java 8 doclint?


---


JDBC Driver

2018-04-02 Thread Ravi Venugopal (C)
Hi

I am trying to POC drill for a customer and I am working on connecting the JDBC 
driver to RDS on Aws for Oracle.

Here is the Certificate of security on the TNS names, I do not see a syntx / 
kvp for the json to add this cert path (Cert info hidden)


(SECURITY = (SSL_SERVER_CERT_DN = 
"C=US,ST=Somewhere,L=Cityname,O=Amazon.com,OU=RDS,CN=.y.us-ABCD-1.rds.amazonaws.com")))


PS: ODBC is not having the option to enable this as well.

Can someone help please.

This e-mail and any attachments are for the sole use of the intended 
recipient(s) and may contain information that is legally privileged and/or 
confidential information. If you are not the intended recipient(s) and have 
received this e-mail in error, please immediately notify the sender by return 
e-mail and delete this e-mail from your computer. Any distribution, disclosure 
or the taking of any other action by anyone other than the named recipient is 
strictly prohibited.


[GitHub] drill issue #1182: DRILL-6287: apache-release profile should be disabled by ...

2018-04-02 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1182
  
Sorry, Maven not being a strong point, I didn't understand initially what I 
was looking at.

+1



---


[GitHub] drill issue #1182: DRILL-6287: apache-release profile should be disabled by ...

2018-04-02 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1182
  
There are two issues with enabling `apache-release` by default:
-  it triggers creating source `apache-drill-...-src.tar.gz` and 
`apache-drill-...-src.zip` archives.
- maven build for any sub-module fails.

The change disables activation of the `apache-release` profile based on JDK 
version and requires explicit activation during the Apache release process.

JDK 1.7 is not supported. See DRILL-1491 and #1143.


---


[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...

2018-04-02 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1166
  
@rajrahul thanks for making all the changes (and of course for the fix)!


---


[GitHub] drill issue #1182: DRILL-6287: apache-release profile should be disabled by ...

2018-04-02 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1182
  
I don't understand why the apache-release be disabled by default. And I 
don't see how this change achieves that anyway.

Also, moving -Xdoclint:none to all profiles implies we are no longer 
supporting development using JDK7 ? I'm OK with that, but not sure if we 
concluded that at the time of the 1.13 release.

If that's what we want to do, I'm fine with this change.



---