[GitHub] drill issue #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsException ...
Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1144 It is not clear why get/set Byte/Char/Short/Int/Long/Float/Double do not delegate to UDLE, while get/set Bytes delegates to UDLE and relies on netty 'AbstractByteBuf` for bounds checking. IMO, it will be good to have the behavior consistent for all methods. In many cases including `VariableLengthVectors`, there is no need to rely on UDLE boundary checking as a caller already provides or can provide a guarantee that an index is within a buffer boundaries. In those cases, boundary check becomes an extra cost. IMO, it will be good to have a consistent behavior with ability to enable bounds checking for debugging. ---
[GitHub] drill pull request #258: DRILL-4091: Support for additional gis operations i...
Github user ChrisSandison commented on a diff in the pull request: https://github.com/apache/drill/pull/258#discussion_r178652351 --- Diff: contrib/gis/src/main/java/org/apache/drill/exec/expr/fn/impl/gis/STUnionAggregate.java --- @@ -0,0 +1,114 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.expr.fn.impl.gis; + +import javax.inject.Inject; + +import org.apache.drill.exec.expr.DrillAggFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.annotations.Workspace; +import org.apache.drill.exec.expr.holders.BigIntHolder; +import org.apache.drill.exec.expr.holders.IntHolder; +import org.apache.drill.exec.expr.holders.NullableVarBinaryHolder; +import org.apache.drill.exec.expr.holders.ObjectHolder; +import org.apache.drill.exec.expr.holders.UInt1Holder; + +import com.esri.core.geometry.SpatialReference; + +import io.netty.buffer.DrillBuf; + +@FunctionTemplate(name = "st_unionaggregate", scope = FunctionTemplate.FunctionScope.POINT_AGGREGATE) --- End diff -- Is there documentation for that for aggregate functions? ---
Re: "Death of Schema-on-Read"
On Mon, Apr 2, 2018 at 10:54 AM, Aman Sinhawrote: > ... > Although, one may argue that XML died because of the weight of the extra > structure added to it and people just gravitated towards JSON. > My argument would be that it died because it couldn't distinguish well between an element and a list of elements of length 1. JSON avoids that kind of problem. > In that respect, Avro provides a good middle ground. A similar approach > is taken by MapR-DB JSON database which has data type information for the > fields of a JSON document. > True that. But another middle ground representation is a JSON with a side file describing type information derived when the file was previously read. That said, we still have to (a) deal with JSON data which is one of the > most prevalent format in big data space and (b) still have to handle schema > changes even with Avro-like formats. > This is a big deal. To some degree, a lot of this can be handled by two simple mechanisms: 1) record what we learn when scanning a file. That is, if a column is null (or missing) until the final record when it is a float, remember that. This allows subsequent queries to look further ahead when deciding what is happening in a query. 2) allow queries to be restarted when it is discovered that type assumptions are untenable. Currently, schema change is what we call this situation where we can't really recover from mistaken assumptions that are derived incrementally as we scan the data. If we had (1), then the information obtained by the reading that we have done up to the point that schema change was noted could be preserved. That means that we could restart the query with the knowledge of the data types that might later cause a schema change exception. In many cases, that would allow us to avoid that exception entirely on the second pass through the data. In most cases, restarts would not be necessary. I know this because schema change exceptions are currently pretty rare and they would be even more rare if we learned about file schemas from experience. Even when a new file is seen for the first time, schema change wouldn't happen. As such, the amortized cost of restarts would be very low. On the other hand, the advantage of such a mechanism would be that more queries would succeed and users would be happier. > ... > From Drill's perspective, we have in the past discussed the need for 2 > modes: > - A fixed schema mode which operates in a manner similar to the RDBMSs. > This is needed not just to resolve ambiguities but also for performance. > Why treat a column as nullable when data is non-nullable ? > - A variable schema mode which is what it does today...but this part needs > to be enhanced to be *'declarative' such that ambiguities are removed.* A > user may choose not to create any declaration, in which case Drill would > default to certain documented set of rules that do type conversions. > The restart suggestion above avoids the need for modes but also allows the performance of the fixed schema mode in most cases.
Re: "Death of Schema-on-Read"
Just to clarify, the article seemed to indicate that Comcast has an Avro file for each of their Kafka data sources, and that file contains metadata information. The analogy in Drill would be if we had an ambiguous JSON file, along with an Avro file (say) that defined the columns, their data types, their names, and so on. The exact Comcast design probably wouldn't fit Drill. It is the concept that is thought-provoking. Today, we can do a pretty good job with JSON as long as it is a very clean schema: * No null or missing fields in the first record.* Consistent data types.* Same set of fields in every file. The easiest problem to visualize is when something is missing, Drill has to guess, there are multiple choices, and Drill guesses wrong. Classic example, a field is missing and we guess "Nullable Int" when it is, in fact, a VarChar. Yes, we could drop this "dangling" field, but doing so might be somewhat surprising to the user. For Parquet, we have to deal only with the missing-field problem (i.e. schema evolution.) With CSV, we have the problem of knowing the actual data type of a column (is column "price" really text, or is it text that represents a number? Of what type?) And so it goes. With regard to the two modes; we could even have a single mode in which we use metadata when it is present, and we guess otherwise. With clean schemas (Parquet files with identical schemas), Drill's guesses are sufficient. But, when the situation is ambiguous, we could allow the user to specify just enough metadata to resolve the ambiguity. (In Parquet, say, if field "x" was added later, then the metadata could just say that, "when column 'x' is missing, assume it is 'Date'".) If the user wants to specify everything, then that is just a special case of the "just enough schema" model. I believe Drill can use Hive, but only for the Hive readers. So, a possible first step is to combine the Hive metastore schema information with the Drill native readers. In an ideal world, Drill would have a "schema plugin" along with its storage and format plugins, so Drill can integrate with a variety of metadata systems (including Comcast's unique Avro schema files.) Even better if the schema hints could also be provided via Drill's existing table functions for ad-hoc use. All of this is just something to keep in the back of our minds as we think about how to resolve schema change issues. Thanks, - Paul On Monday, April 2, 2018, 10:54:23 AM PDT, Aman Sinhawrote: It is certainly a huge advantage to have embedded data type information in the data such as provided by Avro format. In the past, XML also had schemas and DTDs. Although, one may argue that XML died because of the weight of the extra structure added to it and people just gravitated towards JSON. In that respect, Avro provides a good middle ground. A similar approach is taken by MapR-DB JSON database which has data type information for the fields of a JSON document. That said, we still have to (a) deal with JSON data which is one of the most prevalent format in big data space and (b) still have to handle schema changes even with Avro-like formats. Comcast's view point suggests the one-size-fits-all approach but there is counter-points to that, for instance as mentioned here [1]. It would be very useful to have a survey of other users/companies that are dealing with the schema evolution issues to get a better understanding of whether Comcast's experience is a broader trend. >From Drill's perspective, we have in the past discussed the need for 2 modes: - A fixed schema mode which operates in a manner similar to the RDBMSs. This is needed not just to resolve ambiguities but also for performance. Why treat a column as nullable when data is non-nullable ? - A variable schema mode which is what it does today...but this part needs to be enhanced to be *'declarative' such that ambiguities are removed.* A user may choose not to create any declaration, in which case Drill would default to certain documented set of rules that do type conversions. [1] https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/ -Aman On Sun, Apr 1, 2018 at 10:46 PM, Paul Rogers wrote: > ...is the name of a provocative blog post [1]. > Quote: "Once found, diverse data sets are very hard to integrate, since > the data typically contains no documentation on the semantics of its > attributes. ... The rule of thumb is that data scientists spend 70% of > their time finding, interpreting, and cleaning data, and only 30% actually > analyzing it. Schema on read offers no help in these tasks, because data > gives up none of its secrets until actually read, and even when read has no > documentation beyond attribute names, which may be inscrutable, vacuous, or > even misleading." > This quote relates to a discussion Salim & I have been having: that Drill > struggles to extract a usable schema
[GitHub] drill pull request #258: DRILL-4091: Support for additional gis operations i...
Github user ChrisSandison commented on a diff in the pull request: https://github.com/apache/drill/pull/258#discussion_r178619522 --- Diff: contrib/gis/src/main/java/org/apache/drill/exec/expr/fn/impl/gis/STXFunc.java --- @@ -0,0 +1,64 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.expr.fn.impl.gis; + +import java.sql.Types; + +import javax.inject.Inject; + +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.holders.Float8Holder; +import org.apache.drill.exec.expr.holders.VarBinaryHolder; + +import com.esri.core.geometry.Geometry.Type; +import com.esri.core.geometry.ogc.OGCPoint; + +import io.netty.buffer.DrillBuf; + +@FunctionTemplate(name = "st_x", scope = FunctionTemplate.FunctionScope.SIMPLE, + nulls = FunctionTemplate.NullHandling.NULL_IF_NULL) +public class STXFunc implements DrillSimpleFunc { + @Param + VarBinaryHolder geomParam; + + @Output + Float8Holder out; + + @Inject + DrillBuf buffer; + + public void setup() { + } + + public void eval() { + +com.esri.core.geometry.ogc.OGCGeometry geom; + +geom = com.esri.core.geometry.ogc.OGCGeometry +.fromBinary(geomParam.buffer.nioBuffer(geomParam.start, geomParam.end - geomParam.start)); + +if(geom != null && geom.geometryType().equals("Point")){ + out.value = ((com.esri.core.geometry.ogc.OGCPoint) geom).X(); +} else { + out.value = Double.NaN; --- End diff -- @cgivre it looks like the assigning of `NaN` is breaking the test suite. Is this the behaviour, or could this be the default null handling that is provided? Ditto for other times this is assigned ---
Re: "Death of Schema-on-Read"
It is certainly a huge advantage to have embedded data type information in the data such as provided by Avro format. In the past, XML also had schemas and DTDs. Although, one may argue that XML died because of the weight of the extra structure added to it and people just gravitated towards JSON. In that respect, Avro provides a good middle ground. A similar approach is taken by MapR-DB JSON database which has data type information for the fields of a JSON document. That said, we still have to (a) deal with JSON data which is one of the most prevalent format in big data space and (b) still have to handle schema changes even with Avro-like formats. Comcast's view point suggests the one-size-fits-all approach but there is counter-points to that, for instance as mentioned here [1]. It would be very useful to have a survey of other users/companies that are dealing with the schema evolution issues to get a better understanding of whether Comcast's experience is a broader trend. >From Drill's perspective, we have in the past discussed the need for 2 modes: - A fixed schema mode which operates in a manner similar to the RDBMSs. This is needed not just to resolve ambiguities but also for performance. Why treat a column as nullable when data is non-nullable ? - A variable schema mode which is what it does today...but this part needs to be enhanced to be *'declarative' such that ambiguities are removed.* A user may choose not to create any declaration, in which case Drill would default to certain documented set of rules that do type conversions. [1] https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/ -Aman On Sun, Apr 1, 2018 at 10:46 PM, Paul Rogerswrote: > ...is the name of a provocative blog post [1]. > Quote: "Once found, diverse data sets are very hard to integrate, since > the data typically contains no documentation on the semantics of its > attributes. ... The rule of thumb is that data scientists spend 70% of > their time finding, interpreting, and cleaning data, and only 30% actually > analyzing it. Schema on read offers no help in these tasks, because data > gives up none of its secrets until actually read, and even when read has no > documentation beyond attribute names, which may be inscrutable, vacuous, or > even misleading." > This quote relates to a discussion Salim & I have been having: that Drill > struggles to extract a usable schema directly from anything but the > cleanest of data sets, leading to unwanted and unexpected schema change > exceptions due to inherent ambiguities in how to interpret the data. (E.g. > in JSON, if we see nothing but nulls, what type is the null?) > A possible answer is further down in the post: "At Comcast, for instance, > Kafka topics are associated with Apache Avro schemas that include > non-trivial documentation on every attribute and use common subschemas to > capture commonly used data... 'Schema on read' using Avro files thus > includes rich documentation and common structures and naming conventions." > Food for thought. > Thanks, > - Paul > [1] https://www.oreilly.com/ideas/data-governance-and-the- > death-of-schema-on-read?imm_mid=0fc3c6=em-data-na-na-newsltr_20180328 > > > > >
[jira] [Created] (DRILL-6306) Should not be able to run queries against disabled storage plugins
Krystal created DRILL-6306: -- Summary: Should not be able to run queries against disabled storage plugins Key: DRILL-6306 URL: https://issues.apache.org/jira/browse/DRILL-6306 Project: Apache Drill Issue Type: Bug Components: Storage - Other Affects Versions: 1.13.0 Reporter: Krystal Currently, queries against disabled storage plugins are returning data. This should not be the case. Queries against disabled storage plugins should fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] drill pull request #1182: DRILL-6287: apache-release profile should be disab...
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/1182#discussion_r178572285 --- Diff: pom.xml --- @@ -66,6 +66,7 @@ 4096 4096 +-Xdoclint:none --- End diff -- Thanks ---
[GitHub] drill pull request #1182: DRILL-6287: apache-release profile should be disab...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1182#discussion_r178567909 --- Diff: pom.xml --- @@ -66,6 +66,7 @@ 4096 4096 +-Xdoclint:none --- End diff -- @vdiravka Please see DRILL-4547. ---
[GitHub] drill pull request #1182: DRILL-6287: apache-release profile should be disab...
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/1182#discussion_r178561972 --- Diff: pom.xml --- @@ -66,6 +66,7 @@ 4096 4096 +-Xdoclint:none --- End diff -- Do we need a task (new Jira) for refactoring the Drill java docs and moving onto Java 8 doclint? ---
JDBC Driver
Hi I am trying to POC drill for a customer and I am working on connecting the JDBC driver to RDS on Aws for Oracle. Here is the Certificate of security on the TNS names, I do not see a syntx / kvp for the json to add this cert path (Cert info hidden) (SECURITY = (SSL_SERVER_CERT_DN = "C=US,ST=Somewhere,L=Cityname,O=Amazon.com,OU=RDS,CN=.y.us-ABCD-1.rds.amazonaws.com"))) PS: ODBC is not having the option to enable this as well. Can someone help please. This e-mail and any attachments are for the sole use of the intended recipient(s) and may contain information that is legally privileged and/or confidential information. If you are not the intended recipient(s) and have received this e-mail in error, please immediately notify the sender by return e-mail and delete this e-mail from your computer. Any distribution, disclosure or the taking of any other action by anyone other than the named recipient is strictly prohibited.
[GitHub] drill issue #1182: DRILL-6287: apache-release profile should be disabled by ...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1182 Sorry, Maven not being a strong point, I didn't understand initially what I was looking at. +1 ---
[GitHub] drill issue #1182: DRILL-6287: apache-release profile should be disabled by ...
Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1182 There are two issues with enabling `apache-release` by default: - it triggers creating source `apache-drill-...-src.tar.gz` and `apache-drill-...-src.zip` archives. - maven build for any sub-module fails. The change disables activation of the `apache-release` profile based on JDK version and requires explicit activation during the Apache release process. JDK 1.7 is not supported. See DRILL-1491 and #1143. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1166 @rajrahul thanks for making all the changes (and of course for the fix)! ---
[GitHub] drill issue #1182: DRILL-6287: apache-release profile should be disabled by ...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1182 I don't understand why the apache-release be disabled by default. And I don't see how this change achieves that anyway. Also, moving -Xdoclint:none to all profiles implies we are no longer supporting development using JDK7 ? I'm OK with that, but not sure if we concluded that at the time of the 1.13 release. If that's what we want to do, I'm fine with this change. ---