[GitHub] drill pull request #1188: DRILL-6271: Updated copyright range in NOTICE

2018-04-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1188


---


[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

2018-04-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1161


---


[GitHub] drill pull request #1197: DRILL-6279: UI indicates operators that spilled in...

2018-04-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1197


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-04-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1181


---


[GitHub] drill pull request #1199: DRILL-6303: Provide a button to copy the Drillbit'...

2018-04-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1199


---


[GitHub] drill pull request #1182: DRILL-6287: apache-release profile should be disab...

2018-04-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1182


---


[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

2018-04-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1166


---


Build failed in Jenkins: drill-scm #953

2018-04-07 Thread Apache Jenkins Server
See 

Changes:

[arina.yelchiyeva] DRILL-6279: Indicate operators that spilled in-memory data 
to disk on

[arina.yelchiyeva] DRILL-6303: Provide a button to copy the Drillbit's JStack 
shown in

[arina.yelchiyeva] DRILL-6287: apache-release profile should be disabled by 
default

[arina.yelchiyeva] DRILL-6271: Updated copyright range in NOTICE

[arina.yelchiyeva] DRILL-6016: Fix for Error reading INT96 created by Apache 
Spark

[arina.yelchiyeva] DRILL-6230: Extend row set readers to handle hyper vectors

[arina.yelchiyeva] DRILL-6284: Add operator metrics for batch sizing for flatten

[arina.yelchiyeva] DRILL-6296: Add operator metrics for batch sizing for merge 
join

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on H31 (ubuntu xenial) in workspace 

Cloning the remote Git repository
Cloning repository https://git-wip-us.apache.org/repos/asf/drill.git
 > git init  # timeout=10
Fetching upstream changes from https://git-wip-us.apache.org/repos/asf/drill.git
 > git --version # timeout=10
 > git fetch --tags --progress 
 > https://git-wip-us.apache.org/repos/asf/drill.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url 
 > https://git-wip-us.apache.org/repos/asf/drill.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # 
 > timeout=10
 > git config remote.origin.url 
 > https://git-wip-us.apache.org/repos/asf/drill.git # timeout=10
Fetching upstream changes from https://git-wip-us.apache.org/repos/asf/drill.git
 > git fetch --tags --progress 
 > https://git-wip-us.apache.org/repos/asf/drill.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
Checking out Revision da241134fb88464139437b05b1feaafbb3014bb0 
(refs/remotes/origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f da241134fb88464139437b05b1feaafbb3014bb0
Commit message: "DRILL-6296: Add operator metrics for batch sizing for merge 
join"
 > git rev-list --no-walk 9a6cb59b9b7a5b127e5f60309ce2f506ede9652a # timeout=10
[drill-scm] $ /home/jenkins/tools/maven/apache-maven-3.3.3/bin/mvn clean 
install -DskipTests
[INFO] Scanning for projects...
[INFO] 
[INFO] Detecting the operating system and CPU architecture
[INFO] 
[INFO] os.detected.name: linux
[INFO] os.detected.arch: x86_64
[INFO] os.detected.version: 4.4
[INFO] os.detected.version.major: 4
[INFO] os.detected.version.minor: 4
[INFO] os.detected.release: ubuntu
[INFO] os.detected.release.version: 16.04
[INFO] os.detected.release.like.ubuntu: true
[INFO] os.detected.release.like.debian: true
[INFO] os.detected.classifier: linux-x86_64
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Apache Drill Root POM
[INFO] tools/Parent Pom
[INFO] tools/freemarker codegen tooling
[INFO] Drill Protocol
[INFO] Common (Logical Plan, Base expressions)
[INFO] Logical Plan, Base expressions
[INFO] exec/Parent Pom
[INFO] exec/memory/Parent Pom
[INFO] exec/memory/base
[INFO] exec/rpc
[INFO] exec/Vectors
[INFO] contrib/Parent Pom
[INFO] contrib/data/Parent Pom
[INFO] contrib/data/tpch-sample-data
[INFO] exec/Java Execution Engine
[INFO] exec/JDBC Driver using dependencies
[INFO] JDBC JAR with all dependencies
[INFO] Drill-on-YARN
[INFO] contrib/kudu-storage-plugin
[INFO] contrib/opentsdb-storage-plugin
[INFO] contrib/mongo-storage-plugin
[INFO] contrib/hbase-storage-plugin
[INFO] contrib/jdbc-storage-plugin
[INFO] contrib/hive-storage-plugin/Parent Pom
[INFO] contrib/hive-storage-plugin/hive-exec-shaded
[INFO] contrib/hive-storage-plugin/core
[INFO] contrib/drill-gis-plugin
[INFO] contrib/kafka-storage-plugin
[INFO] Packaging and Distribution Assembly
[INFO] contrib/mapr-format-plugin
[INFO] contrib/sqlline
[INFO] 
[INFO] 
[INFO] Building Apache Drill Root POM 1.14.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:3.0.0:clean (default-clean) @ drill-root ---
[INFO] 
[INFO] --- apache-rat-plugin:0.12:check (rat-checks) @ drill-root ---
[INFO] Enabled default license matchers.
[INFO] Will parse SCM ignores for exclusions...
[INFO] Parsing exclusions from 

[INFO] Finished adding exclusions from SCM ignore files.
[INFO] 89 implicit excludes (use -debug for more details)

[jira] [Resolved] (DRILL-6296) Add operator metrics for batch sizing for merge join

2018-04-07 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-6296.
-
Resolution: Fixed

Merged with commit id da241134fb88464139437b05b1feaafbb3014bb0.

> Add operator metrics for batch sizing for merge join
> 
>
> Key: DRILL-6296
> URL: https://issues.apache.org/jira/browse/DRILL-6296
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.13.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>Priority: Major
> Fix For: 1.14.0
>
>
> Add operator metrics for batch sizing stats for merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] drill issue #1203: DRILL-6289: Cluster view should show more relevant inform...

2018-04-07 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1203
  
Before the review I guess we need to clarify one thing. After DRILL-6044 
Shutdown button was shown only for the current drillbit. As far as I 
understood, you cannot shutdown other drillbits from Web UI except of current. 
@dvjyothsna please confirm.


---


[jira] [Created] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

2018-04-07 Thread Hanumath Rao Maduri (JIRA)
Hanumath Rao Maduri created DRILL-6312:
--

 Summary: Enable pushing of cast expressions to the scanner for 
better schema discovery.
 Key: DRILL-6312
 URL: https://issues.apache.org/jira/browse/DRILL-6312
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators, Query Planning & 
Optimization
Affects Versions: 1.13.0
Reporter: Hanumath Rao Maduri


Drill is a schema less engine which tries to infer the schema from disparate 
sources at the read time. Currently the scanners infer the schema for each 
batch depending upon the data for that column in the corresponding batch. This 
solves many uses cases but can error out when the data is too different between 
batches like int and array[int] etc... (There are other cases as well but just 
to give one example).

There is also a mechanism to create a view by type casting the columns to 
appropriate type. This solves issues in some cases but fails in many other 
cases. This is due to the fact that cast expression is not being pushed down to 
the scanner but staying at the project or filter etc operators up the query 
plan.

This JIRA is to fix this by propagating the type information embedded in the 
cast function to the scanners so that scanners can cast the incoming data 
appropriately.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: "Death of Schema-on-Read"

2018-04-07 Thread Hanumath Rao Maduri
Hello All,

I have created a JIRA to track this approach.
https://issues.apache.org/jira/browse/DRILL-6312

Thanks,
-Hanu

On Fri, Apr 6, 2018 at 7:38 PM, Paul Rogers 
wrote:

> Hi Aman,
>
> As we get into details, I suggested to Hanu that we move the discussion
> into a JIRA ticket.
>
>  >On the subject of CAST pushdown to Scans, there are potential drawbacks
>
>  >  - In general, the planner will see a Scan-Project where the Project
> has  CAST functions.  But the Project can have arbitrary expressions,  e.g
> CAST(a as INT) * 5
>
> Suggestion: push the CAST(a AS INT) down to the scan, do the a * 5 in the
> Project operator.
>
> >  or a combination of 2 CAST functions
>
> If the user does a two-stage cast, CAST(CAST(a AS INT) AS BIGINT), then
> one simple rule is to push only the innermost cast downwards.
>
> > or non-CAST functions etc.
>
> Just keep it in Project.
>
>  >It would be quite expensive to examine each expression (there could
> be hundreds) to determine whether it is eligible to be pushed to the Scan.
>
> Just push CAST( AS ). Even that would be a huge win.
> Note, for CSV, it might have to be CAST(columns[2] AS INT), since "columns"
> is special for CSV.
>
> >   - Expressing Nullability is not possible with CAST.  If a column
> should be tagged as  (not)nullable, CAST syntax does not allow that.
>
> Can we just add keywords: CAST(a AS INT NULL), CAST(b AS VARCHAR NOT NULL)
> ?
>
>  >  - Drill currently supports CASTing to a SQL data type, but not to
> the complex types such as arrays and maps.  We would have to add support
> for that from a language perspective as well as the run-time.  This would
> be non-trivial effort.
>
> The term "complex type" is always confusing. Consider a map. The rules
> would apply recursively to the members of the map. (Problem: today, if I
> reference a map member, Drill pulls it to the top level: SELECT m.a creates
> a new top-level field, it does not select "a" within "m". We need to fix
> that anyway.  So, CAST(m.a AS INT) should imply the type of column "a"
> within map "m".
>
> For arrays, the problem is more complex. Perhaps more syntax: CAST(a[] AS
> INT) to force array elements to INT. Maybe use CAST(a[][] AS INT) for a
> repeated list (2D array).
>
> Unions don't need a solution as they are their own solution (they can hold
> multiple types.) Same for (non-repeated) lists.
>
> To resolve runs of nulls, maybe allow CAST(m AS MAP). Or we can imply that
> "m" is a Map from the expression CAST(m.a AS INT). For arrays, the
> previously suggested CAST(a[] AS INT). If columns "a" or "m" turn out to be
> a non-null scalar, then we have no good answer.
>
> CAST cannot solve the nasty cases of JSON in which some fields are
> complex, some scalar. E.g. {a: 10} {a: [20]} or {m: "foo"} {m: {value:
> "foo"}}. I suppose no solution is perfect...
>
> I'm sure that, if someone gets a chance to desig this feature, they'll
> find lots more issues. Maybe cast push-down is only a partial solution.
> But, it seems to solve so many of the JSON and CSV cases that I've seen
> that it seems too good to pass up.
>
> Thanks,
>
>
> - Paul


Non-column filters in Drill

2018-04-07 Thread Ryan Shanks

Hi Drill Dev Team!

I am writing a custom storage plugin and I am curious if it is possible 
in Drill to pass a filter value, in the form of a where clause, that is 
not related to a column. What I would like to accomplish is something like:


select * from myTable where notColumn = 'value';

In the example, notColumn is not a column in myTable, or any other 
table, it is just a specific parameter that the storage plugin will use 
in the filtering process. Additionally, notColumn would not be returned 
as a column so Drill needs to not expect it as a part of the 'select *'. 
I created a rule that will push down and remove these non-column filter 
calls, but I need to somehow tell drill/calcite that the filter name is 
valid, without actually registering it as a column. The following error 
occurs prior to submitting any rules:


org.apache.drill.common.exceptions.UserRemoteException: VALIDATION 
ERROR: From line 1, column 35 to line 1, column 39: Column 'notColumn' 
not found in any table



Alternatively, can I manipulate star queries to only return a subset of 
all the columns for a table?


Any insight would be greatly appreciated!

Thanks,
Ryan


Jenkins build is back to normal : drill-scm #954

2018-04-07 Thread Apache Jenkins Server
See 



Drill Jenkins jobs

2018-04-07 Thread Vlad Rozov

Hi Vitalii,

To be able to update Jenkins job configuration you will need to ask Aman 
(PMC chair) to grant you (or any other Apache members or PMCs) 
permission. See 
https://cwiki.apache.org/confluence/display/INFRA/Jenkins. It is Drill 
community responsibility to maintain Jenkins builds.


I am already member of  the hudson-jobadmin group and updated Java and 
maven version.


Thank you,

Vlad

On 3/21/18 04:53, Vitalii Diravka wrote:

Hi all!

I have noticed that builds.apache.org has "Jenkins builds" for drill-scm
project.
And last builds are failed due to old Java version.
Does anybody know how to update Java version there or who is responsible
for it?

Thanks!

Kind regards
Vitalii

-- Forwarded message --
From: Apache Jenkins Server 
Date: Tue, Mar 20, 2018 at 10:59 PM
Subject: Build failed in Jenkins: drill-scm #948
To: dev@drill.apache.org


See 

Changes:

[vitalii.diravka] DRILL-6241: Saffron properties config has the excessive
permissions

[vitalii.diravka] DRILL-6250: Sqlline start command with password appears
in the

[vitalii.diravka] DRILL-6275: Fixed direct memory reporting in sys.memory.

[vitalii.diravka] DRILL-6199: Add support for filter push down and
partition pruning with

[vitalii.diravka] DRILL-6145: Implement Hive MapR-DB JSON handler

--
[...truncated 140.51 KB...]
  Downloading: http://repository.mapr.com/maven/net/sourceforge/fmpp/
fmpp/0.9.14/fmpp-0.9.14.pom
  Downloading: http://repo.dremio.com/release/net/sourceforge/fmpp/
fmpp/0.9.14/fmpp-0.9.14.pom
  Downloading: http://repository.mapr.com/nexus/content/repositories/
drill/net/sourceforge/fmpp/fmpp/0.9.14/fmpp-0.9.14.pom
  Downloading: https://repo.maven.apache.org/
maven2/net/sourceforge/fmpp/fmpp/0.9.14/fmpp-0.9.14.pom
3/3 KB   3/3 KBDownloaded: https://repo.maven.apache.org/
maven2/net/sourceforge/fmpp/fmpp/0.9.14/fmpp-0.9.14.pom (3 KB at 135.3
KB/sec)
Downloading: http://repo.dremio.com/release/oro/oro/maven-metadata.xml
Downloading: http://conjars.org/repo/oro/oro/maven-metadata.xml
Downloading: http://repository.mapr.com/maven/oro/oro/maven-metadata.xml
Downloading: http://repository.mapr.com/nexus/content/repositories/
drill-optiq/oro/oro/maven-metadata.xml
  Downloading: http://repository.mapr.com/nexus/content/repositories/
drill/oro/oro/maven-metadata.xml
Downloading: http://repository.apache.org/
snapshots/oro/oro/maven-metadata.xml
Downloading: https://oss.sonatype.org/content/repositories/
snapshots/oro/oro/maven-metadata.xml
Downloading: https://repo.maven.apache.org/maven2/oro/oro/maven-metadata.xml
260/260 B   Downloaded: https://repo.maven.apache.org/
maven2/oro/oro/maven-metadata.xml (260 B at 13.4 KB/sec)
 Downloading:
http://conjars.org/repo/org/beanshell/bsh/maven-metadata.xml
Downloading: http://repo.dremio.com/release/org/beanshell/bsh/
maven-metadata.xml
Downloading: http://repository.mapr.com/maven/org/beanshell/bsh/maven-
metadata.xml
Downloading: http://repository.mapr.com/nexus/content/repositories/
drill-optiq/org/beanshell/bsh/maven-metadata.xml
 Downloading: http://repository.mapr.com/
nexus/content/repositories/drill/org/beanshell/bsh/maven-metadata.xml
 Downloading: https://oss.sonatype.org/
content/repositories/snapshots/org/beanshell/bsh/maven-metadata.xml
 Downloading: http://repository.apache.org/
snapshots/org/beanshell/bsh/maven-metadata.xml
Downloading: https://repo.maven.apache.org/maven2/org/beanshell/bsh/
maven-metadata.xml
323/323 B   Downloaded: https://repo.maven.apache.org/
maven2/org/beanshell/bsh/maven-metadata.xml (323 B at 16.6 KB/sec)
 Downloading:
http://conjars.org/repo/org/beanshell/bsh/2.0b5/bsh-2.0b5.pom
 Downloading: http://repository.mapr.com/
maven/org/beanshell/bsh/2.0b5/bsh-2.0b5.pom
 Downloading: http://repo.dremio.com/release/org/beanshell/bsh/2.
0b5/bsh-2.0b5.pom
 Downloading: http://repository.mapr.com/
nexus/content/repositories/drill/org/beanshell/bsh/2.0b5/bsh-2.0b5.pom
 Downloading: https://repo.maven.apache.org/
maven2/org/beanshell/bsh/2.0b5/bsh-2.0b5.pom
2/2 KB   Downloaded: https://repo.maven.apache.org/
maven2/org/beanshell/bsh/2.0b5/bsh-2.0b5.pom (2 KB at 57.5 KB/sec)
Downloading: http://conjars.org/repo/xml-resolver/xml-resolver/maven-
metadata.xml
Downloading: http://repository.mapr.com/nexus/content/repositories/
drill-optiq/xml-resolver/xml-resolver/maven-metadata.xml
Downloading: http://repository.mapr.com/maven/xml-resolver/xml-
resolver/maven-metadata.xml
Downloading: http://repo.dremio.com/release/xml-resolver/xml-
resolver/maven-metadata.xml
  Downloading: http://repository.mapr.com/nexus/content/repositories/
drill/xml-

Re: Non-column filters in Drill

2018-04-07 Thread Hanumath Rao Maduri
Hello Ryan,

Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
supplied by the underlying scan.
However, I expect that this column will be present in the scan but not past
the filter (notColumn = 'value') in the plan.
In that case you may need to pushdown the filter to the groupScan and then
remove the column projections from your custom groupscan.

It would be easy for us to guess what could be the issue, if you can post
the logical and physical query plan's for this query.

Hope this helps. Please do let us know if you have any further issues.

Thanks,


On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks 
wrote:

> Hi Drill Dev Team!
>
> I am writing a custom storage plugin and I am curious if it is possible in
> Drill to pass a filter value, in the form of a where clause, that is not
> related to a column. What I would like to accomplish is something like:
>
> select * from myTable where notColumn = 'value';
>
> In the example, notColumn is not a column in myTable, or any other table,
> it is just a specific parameter that the storage plugin will use in the
> filtering process. Additionally, notColumn would not be returned as a
> column so Drill needs to not expect it as a part of the 'select *'. I
> created a rule that will push down and remove these non-column filter
> calls, but I need to somehow tell drill/calcite that the filter name is
> valid, without actually registering it as a column. The following error
> occurs prior to submitting any rules:
>
> org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR:
> From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> in any table
>
>
> Alternatively, can I manipulate star queries to only return a subset of
> all the columns for a table?
>
> Any insight would be greatly appreciated!
>
> Thanks,
> Ryan
>


[jira] [Created] (DRILL-6313) ScanBatch.Mutator does not report new schema for empty first batch

2018-04-07 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6313:
--

 Summary: ScanBatch.Mutator does not report new schema for empty 
first batch
 Key: DRILL-6313
 URL: https://issues.apache.org/jira/browse/DRILL-6313
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.13.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.14.0


Create a format plugin that honors an empty select list by returning no 
columns. This case occurs in a {{COUNT(\*)}} query.

When run, the query fails with:

{noformat}
SYSTEM ERROR: IllegalStateException: next() returned OK without first returning 
OK_NEW_SCHEMA [#2, ScanBatch]
{noformat}

The reason is that the {{Mutator}} class uses a flag, {{schemaChanged}}, which 
defaults to {{schemaChanged}}. It is set to {{true}} only when a field is 
added. But, since the query requested no fields, no field is added.

The fix is simple, just default {{schemaChanged}} to {{true}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] DrillBuf

2018-04-07 Thread Vlad Rozov

Hi Paul,

My comments in-line.

Thank you,

Vlad

On 4/5/18 20:50, Paul Rogers wrote:

Hi Vlad,

  I'd suggest to keep focus on DrillBuf design and implementation as the only 
gate for accessing raw (direct) memory.

I was doing that. By explaining where DrillBuf fits in the overall design, we 
see that DrillBuf should be the only access point for direct memory. The 
context explains why this is the right decision. Changes to DrillBuf should 
support our design as DrillBuf only exists for that purpose.
My concern is not why Drill adopted Netty as Netty does provide a good 
amount of functionality on top of Java networking and NIO. I also do not 
propose to replace Netty with something else. My primary focus for this 
thread is the design and implementation of the DrillBuf Java class 
itself. Namely, why was it necessary to introduce DrillBuf and/or 
UnsafeDirectLittleEndian? What functionality do those classes provide 
that existing Netty classes do not? Netty already provides memory 
pooling, reference counting, slicing, composite buffers, working with 
direct and heap memory. By looking at the DrillBuf.java git history, the 
DrillBuf was introduced in 2014 and prior to that Netty classes were 
used directly. Unfortunately, the commit that introduced DrillBuf does 
not provide any info why it was introduced and does not have a reference 
to a JIRA.


One may argue that DrillBuf is a way for Drill to encapsulate Netty 
ByteBuf and guard other modules that use DrillBuf from Netty ByteBuf 
API, so if Netty decides to change ByteBuf API in the next major 
release, amount of changes will be limited to DrillBuf only. Problem is 
that DrillBuf inherits from Netty AbstractByteBuf, so the above goal is 
not achieved either.




1. Boundary checking (on/off based on a flag or assertions being 
enabled/disabled, always on, always off, any other suggestions)

By understanding the design, we can see that we do, in fact, need both checked 
and unchecked methods. The row set mechanisms takes it upon themselves to have 
sufficient context to ensure that memory access is always within bounds, and so 
can benefit from the unchecked methods. As we said, we need debug-time bounds 
checks to catch errors during development.

On the other hand, value vectors should probably be protected by using checked 
methods because they do not have intrinsic mechanisms that ensure correct 
access. With vectors, the memory location to access is set by the caller (each 
operator) and there is no guarantee that all this code is correct all the time. 
(Though, it probably is right now because if it wasn't we'd get errors.)
I don't see a difference in bounds checking requirements between row set 
mechanism and value vectors as value vectors do have "safe" methods or 
intrinsic mechanism that ensures correct access. If not all operators 
use "safe" methods, than that operator should provide a guarantee. At 
the end, if an operator accesses memory out of allocated bounds, 
boundary checking will not fix it. If there is a bug in row set 
mechanism, value vectors or an operator, the end result (crash of the 
JVM) is the same for all.


The proposal just made represents a change; currently the two mechanisms use the same set 
of methods, which puts us into the "should we turn on bounds checks for everyone or 
turn them off for everyone" dilemma.

This is a technical design decision, not a community preference (other than 
that we'd prefer that stuff works...)
On the dev list, almost everything is a technical design decision and 
the community (compared to Drill users who prefer that the community 
makes a choice and stuff works) needs to agree on how to proceed and 
what development practices to adopt and follow.



2. Ref count checking (delegate to netty or have a separate mechanism to 
enable/disable, always on or off)


Ref counts are absolutely necessary, in the current design, for the reasons 
explained earlier: a single memory block can be shared by multiple DrillBufs. 
We have no other way at present to know when the last reference goes away if we 
don't have ref counts.


To deprecate reference counts, we'd have to rework the way that memory is 
transferred between operators. We'd have to deprecate shared buffers. (Or, we'd 
have to move to the fixed blocks mentioned earlier; but even then we'd need ref 
counts if a single sender can feed data to multiple fragments without copies.)

Again, this is not a preference issue, it is a fundamental design issue (unless 
you know of a trick to remove the need for ref counts, in which case please do 
propose it.) Or, if there is a better way to implement bounds checks that is 
faster or simpler, please do propose that.
The question is not about deprecating reference count mechanism. 
Question is whether or not to check that reference count is not zero 
every time DrillBuf is used (see ensureAccessible()).




3. Usage of UDLE


If we meet the design goals stated earlier, and DrillBuf is the only in

Re: Non-column filters in Drill

2018-04-07 Thread Ted Dunning
Ryan

What would happen if you defined a column so that you could use the normal
pushdown mechanism? In most cases, you wouldn't return the value of the
column since the only purpose is to use as a filter, but nothing should
prevent you from returning the value of this not-really-a-column.

By letting it be considered as if a column, all of the normal mechanisms
can be brought to bear.

Another case can be seen in how the CSV reader lets your inject separators
and such.

On Sat, Apr 7, 2018, 16:23 Ryan Shanks  wrote:

> Hi Drill Dev Team!
>
> I am writing a custom storage plugin and I am curious if it is possible
> in Drill to pass a filter value, in the form of a where clause, that is
> not related to a column. What I would like to accomplish is something like:
>
> select * from myTable where notColumn = 'value';
>
> In the example, notColumn is not a column in myTable, or any other
> table, it is just a specific parameter that the storage plugin will use
> in the filtering process. Additionally, notColumn would not be returned
> as a column so Drill needs to not expect it as a part of the 'select *'.
> I created a rule that will push down and remove these non-column filter
> calls, but I need to somehow tell drill/calcite that the filter name is
> valid, without actually registering it as a column. The following error
> occurs prior to submitting any rules:
>
> org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR: From line 1, column 35 to line 1, column 39: Column 'notColumn'
> not found in any table
>
>
> Alternatively, can I manipulate star queries to only return a subset of
> all the columns for a table?
>
> Any insight would be greatly appreciated!
>
> Thanks,
> Ryan
>


Re: Non-column filters in Drill

2018-04-07 Thread Aman Sinha
A better option would be to have a user-defined function that takes 2
parameters and evaluates to a boolean value.
 e.g   select * from myTable where MyUDF(notColumn, 'value')  IS TRUE;

The Storage Plugin that you are developing would need to implement a
pushdown rule that  looks
at the filter condition and if it contains 'MyUDF()', it would pushdown to
the scan/reader corresponding to your plugin.


On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri 
wrote:

> Hello Ryan,
>
> Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
> supplied by the underlying scan.
> However, I expect that this column will be present in the scan but not past
> the filter (notColumn = 'value') in the plan.
> In that case you may need to pushdown the filter to the groupScan and then
> remove the column projections from your custom groupscan.
>
> It would be easy for us to guess what could be the issue, if you can post
> the logical and physical query plan's for this query.
>
> Hope this helps. Please do let us know if you have any further issues.
>
> Thanks,
>
>
> On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks 
> wrote:
>
> > Hi Drill Dev Team!
> >
> > I am writing a custom storage plugin and I am curious if it is possible
> in
> > Drill to pass a filter value, in the form of a where clause, that is not
> > related to a column. What I would like to accomplish is something like:
> >
> > select * from myTable where notColumn = 'value';
> >
> > In the example, notColumn is not a column in myTable, or any other table,
> > it is just a specific parameter that the storage plugin will use in the
> > filtering process. Additionally, notColumn would not be returned as a
> > column so Drill needs to not expect it as a part of the 'select *'. I
> > created a rule that will push down and remove these non-column filter
> > calls, but I need to somehow tell drill/calcite that the filter name is
> > valid, without actually registering it as a column. The following error
> > occurs prior to submitting any rules:
> >
> > org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR:
> > From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> > in any table
> >
> >
> > Alternatively, can I manipulate star queries to only return a subset of
> > all the columns for a table?
> >
> > Any insight would be greatly appreciated!
> >
> > Thanks,
> > Ryan
> >
>


Re: "Death of Schema-on-Read"

2018-04-07 Thread Paul Rogers
Hi Hanu,
Thanks! After sleeping on the idea, I realized that it can be generalized for 
any kind of expression. But, I also realized that the cast mechanism, by 
itself, cannot be a complete solution. Details posted in the JIRA for anyone 
who is interested.
Thanks,
- Paul

 

On Saturday, April 7, 2018, 8:05:34 AM PDT, Hanumath Rao Maduri 
 wrote:  
 
 Hello All,

I have created a JIRA to track this approach.
https://issues.apache.org/jira/browse/DRILL-6312

Thanks,
-Hanu

  

Re: Non-column filters in Drill

2018-04-07 Thread Paul Rogers
Hi Ryan,

There is an obscure, but very handy feature of Drill called table functions. 
[1] These allow you to set parameters of your format plugin as part of a query.

You mentioned a storage plugin. I've not tried a table function with a storage 
plugin. I have tested table functions with a format plugin.

Your format or storage plugin has a Jackson-serializable Java class. Normally 
you set the properties for your plugin in the Drill web console. But, these can 
also be set in the table function.

I had a use case something like yours. I defined an example "regex" plugin 
where the user can specify a regular expression to apply to to a text file to 
parse columns. The use can then provide a list of column names. Using the table 
function, I could specify the regex and column names per-query.

This exercise did, however, point out two current limitations of table 
functions. First, they work only with simple data types (strings, ints). 
(DRILL-6169) So, my list of columns has to be a single string with a comma 
delimited list of columns. I could not use the more natural list of strings. 
Second, table functions do not retain the configured value of parameters: you 
have to include all parameters in the function, not just the ones you want to 
change. (DRILL-6168)

Yet another option is to set a session option. However, unless you do a bit of 
clever coding, format plugins don't have visibility to session options 
(DRILL-5181).

Perhaps your use case provides a compelling reason to fix some of these 
limitations...

Thanks,

- Paul

[1] 
https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters,
 see the section "Using the Formats Attributes as Table Function Parameters".


On Saturday, April 7, 2018, 10:37:05 PM PDT, Aman Sinha 
 wrote:  
 
 A better option would be to have a user-defined function that takes 2
parameters and evaluates to a boolean value.
 e.g  select * from myTable where MyUDF(notColumn, 'value')  IS TRUE;

The Storage Plugin that you are developing would need to implement a
pushdown rule that  looks
at the filter condition and if it contains 'MyUDF()', it would pushdown to
the scan/reader corresponding to your plugin.


On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri 
wrote:

> Hello Ryan,
>
> Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
> supplied by the underlying scan.
> However, I expect that this column will be present in the scan but not past
> the filter (notColumn = 'value') in the plan.
> In that case you may need to pushdown the filter to the groupScan and then
> remove the column projections from your custom groupscan.
>
> It would be easy for us to guess what could be the issue, if you can post
> the logical and physical query plan's for this query.
>
> Hope this helps. Please do let us know if you have any further issues.
>
> Thanks,
>
>
> On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks 
> wrote:
>
> > Hi Drill Dev Team!
> >
> > I am writing a custom storage plugin and I am curious if it is possible
> in
> > Drill to pass a filter value, in the form of a where clause, that is not
> > related to a column. What I would like to accomplish is something like:
> >
> > select * from myTable where notColumn = 'value';
> >
> > In the example, notColumn is not a column in myTable, or any other table,
> > it is just a specific parameter that the storage plugin will use in the
> > filtering process. Additionally, notColumn would not be returned as a
> > column so Drill needs to not expect it as a part of the 'select *'. I
> > created a rule that will push down and remove these non-column filter
> > calls, but I need to somehow tell drill/calcite that the filter name is
> > valid, without actually registering it as a column. The following error
> > occurs prior to submitting any rules:
> >
> > org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR:
> > From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> > in any table
> >
> >
> > Alternatively, can I manipulate star queries to only return a subset of
> > all the columns for a table?
> >
> > Any insight would be greatly appreciated!
> >
> > Thanks,
> > Ryan
> >
>
  

Re: [DISCUSS] DrillBuf

2018-04-07 Thread Paul Rogers
Hi Vlad,

Thanks for the clarifications. My general comment is that it is always good to 
refactor things if that is the fastest way to achieve some goal. It is not 
clear, however, what the goal is here other than code improvement. Since Drill 
still has plenty of opportunities for improvement that will help users, I guess 
it's a bit unclear why we'd clean up code just for its own sake if doing so 
will entail a large amount of work and provide no user benefit.

I say this because I've done quite a bit of work in surrounding code and 
learned how much work it takes to make these kinds of changes and then 
stabilize the result.

On the other hand, if you are working on a memory-related project that is 
making major changes, and these issues are getting in your way, then 
refactoring could well be the fastest way to achieve your project goals. Are 
you working on such a project?

> why was it necessary to introduce DrillBuf and/or UnsafeDirectLittleEndian? 
> What functionality do those classes provide that existing Netty classes do 
> not?
> IMO, it will be good to make DrillBuf code simpler and consistent.

Good questions!  So, it seems that DrillBuf turns out to be a bit of a muddle. 
But, unless it is preventing us from making some desired change, or changing 
things will provide a significant performance boost, I'd guess I'd just hold my 
nose and leave it as is for now.

The same argument applies, by the way, to the value vectors. The value vector 
classes become almost entirely redundant once the row set mechanisms are 
adopted. But, I suspect that value vectors will live on anyway until there is a 
reason to do something with them.

> Question is whether or not to check that reference count is not zero every 
> time DrillBuf is used (see ensureAccessible()).

IMHO, there is no reason to check on every access, except in a "paranoid" debug 
mode. Removing the check might provide a nice little performance bump. Avoiding 
bounds and ref checks was one of the goals of the "unchecked" methods that I 
had in DrillBuf but which we decided to remove... If it turns out that the 
methods used by the row set abstractions do, in fact, do bounds checks, then 
this is a strong case to put the "unsafe" (unchecked) methods back.

> 5. Moving DrillBuf to a different package


I agree with your explanation. However, I assume the original authors were 
forced to put DrillBuf in the Netty package for some reason or other. If that 
reason is no longer valid, then it is usually pretty simple to have your IDE 
move the class to a different package and adjust all its references. This 
improvement, if possible, seems low cost and so might be worth doing.

On the other hand, if the move causes things to break, which causes effort to 
go into changing things, I guess I'd wonder why we can't just leave well enough 
alone and focus on things which are actually broken or could use a performance 
boost.

In short, I'm all for refactoring when it helps us deliver fixes or new 
features to customers. But, I'm struggling to see the user benefit in this 
case. Can you help me to understand the user benefit of these changes?

Thanks,

- Paul

 

On Saturday, April 7, 2018, 9:35:06 PM PDT, Vlad Rozov  
wrote:  
 
 Hi Paul,

My comments in-line.

Thank you,

Vlad

On 4/5/18 20:50, Paul Rogers wrote:
> Hi Vlad,
>>  I'd suggest to keep focus on DrillBuf design and implementation as the only 
>>gate for accessing raw (direct) memory.
> I was doing that. By explaining where DrillBuf fits in the overall design, we 
> see that DrillBuf should be the only access point for direct memory. The 
> context explains why this is the right decision. Changes to DrillBuf should 
> support our design as DrillBuf only exists for that purpose.
My concern is not why Drill adopted Netty as Netty does provide a good 
amount of functionality on top of Java networking and NIO. I also do not 
propose to replace Netty with something else. My primary focus for this 
thread is the design and implementation of the DrillBuf Java class 
itself. Namely, why was it necessary to introduce DrillBuf and/or 
UnsafeDirectLittleEndian? What functionality do those classes provide 
that existing Netty classes do not? Netty already provides memory 
pooling, reference counting, slicing, composite buffers, working with 
direct and heap memory. By looking at the DrillBuf.java git history, the 
DrillBuf was introduced in 2014 and prior to that Netty classes were 
used directly. Unfortunately, the commit that introduced DrillBuf does 
not provide any info why it was introduced and does not have a reference 
to a JIRA.

One may argue that DrillBuf is a way for Drill to encapsulate Netty 
ByteBuf and guard other modules that use DrillBuf from Netty ByteBuf 
API, so if Netty decides to change ByteBuf API in the next major 
release, amount of changes will be limited to DrillBuf only. Problem is 
that DrillBuf inherits from Netty AbstractByteBuf, so the above goal is 
not achieved eit