Re: License headers inside Javadoc comments

2024-01-29 Thread Paul Rogers
James,

If the extra check is costly, you might also observe that all (most?)
existing files have the proper header format. It is only new or changed
files that must be checked. So, you can use Git to determine the change set
on each PR and do the extra format check only on those files.

- Paul

On Mon, Jan 29, 2024 at 7:37 AM James Turton  wrote:

> Thank you for these explanations Claude.
>
> Looking at your second paragraph about the proposal to enhance the code
> that inserts headers, a comment start definition for Java files of
> '/*\n' (newline after the '/*') should work to accept the Apache license
> header in a Java comment but reject it if it's in a Javadoc comment.
> That seems promising, and I'll take a look at RAT-330, but I'm also able
> to move forward in Drill using alternatives in the interim.
>
> Regards
> James
>
>
> On 2024/01/29 09:33, Claude Warren wrote:
> > James,
> >
> > The in general processing for matching licenses strips out all non
> > essential text (e.g. '/' and '*') so the current implementation can not
> > determine if the license text is within a javadoc block or not.  Some
> > matchers (e.g. Copyright, SPDX, and regex) do use the unmodified text but
> > they are generally much slower.  Infact, the original SPDX and Copyright
> > implementations caused a significant (2 order of magnitude or more)
> > increase in processing time.  It would be possible to create a custom
> > matcher to do what you want.  But there is no mechanism currently
> available
> > in the code base to only call a matcher on specific file types.
> >
> > There is a section of code that understands file types, but this is the
> > code that inserts headers into files that don't have them.  It may be
> > possible to build on that to create a custom matcher to ensure that
> license
> > comments are not within java docs.  There is a ticket open to modify how
> > this code works so that new file types with comment start stop
> definitions
> > and restrictions on first lines and such can be defined outside of the
> > codebase, making it possible to insert headers in as yet unrecognized
> file
> > formats.[1]  This might be extended and provide input to the process you
> > are requesting.
> >
> > There is also a section of code that removes the non essential text.  The
> > 'prune' method could be modified to remove blocks of code between the
> > opening javadoc '/**' and the closing '*/'.  But this may lead to
> problems
> > with non java files.  Speaking of non java files have you thought about
> > ensuring that the license does not appear in other javadoc like systems?
> > [2]  Once this can of worms is opened we will need a way to manage all
> the
> > requests that will follow for other file types.
> >
> > If you have any ideas for implementing the change I would be interested
> to
> > hear them.
> >
> > Claude
> >
> > [1] https://issues.apache.org/jira/browse/RAT-330?
> > [2]
> >
> https://stackoverflow.com/questions/5334531/using-javadoc-for-python-documentation
> >
> > On Fri, Jan 26, 2024 at 2:38 PM James Turton  wrote:
> >
> >> Thanks Phil.
> >>
> >> Here's some background [1] which comes from before I was involved with
> >> Drill. What they wanted was for the license header checker to accept, in
> >> .java files,
> >>
> >> /*
> >>* Licensed to the Apache Software Foundation (ASF) under one
> >>* or more contributor license agreements.  See the NOTICE file
> >>* distributed with this work for additional information
> >>  etc.
> >>
> >> but reject
> >>
> >> /**
> >>* Licensed to the Apache Software Foundation (ASF) under one
> >>* or more contributor license agreements.  See the NOTICE file
> >>* distributed with this work for additional information
> >>  etc.
> >>
> >> Notice the two asterisks that open the Java comment block in the second
> >> form thereby making it a Javadoc comment that will appear in generated
> >> Javadoc. There are no longer any examples of the latter in Drill but
> >> this has been enforced by the addition of the license-maven-plugin.
> >>
> >> I got here because I want to remove that plugin, which essentially
> >> duplicates RAT, in favour of another (with exactly the same name :()
> >> that can generate license and notice information for our third party
> >> code. This last task is what I'm really doing, the Javadoc license
> >> header rejection matter is yak shaving that came up on the road.
> >>
> >> So my yak shaving question is: if I make RAT Drill's only license header
> >> checker then could I make it reject license headers of the second form?
> >> Even if I can't I'm inclined to make it the only header checker since I
> >> think that it's in any case mandatory and authoritative. But in an
> >> effort to retain the work of the previous Drill developers I'm trying to
> >> preserve what they implemented.
> >>
> >> 1. https://issues.apache.org/jira/browse/DRILL-6320
> >>
> >> On 2024/01/26 14:06, P. Ottlinger wrote:
> >>> Hi James,
> >>>
> 

Re: [Important] GSoC 2024 Project Ideas

2024-01-27 Thread Paul Rogers
Some ideas:

* Time marches on. Drill has a design from ten years back. What modern
environment things do current users need? Integration with Amazon Glue?
Delta lake/lakehouse/whatever the cool new thing is? Integration with the
latest & greatest BI tools?
* Seems many folks use Drill as a desktop tool. But, Drill is designed for
a distributed environment. Could we provide an in-process exchange operator
that just shifts ownership of vectors rather than serializing them over the
network back to the same process? What other changes would be helpful?
* Implementing modern JSON support: store complex types as Java objects
using the Object Vector. Implement the standard SQL JSON functions.
* Add an Avatica-based JDBC interface. I can provide some of the
server-side stuff from a project I did many moons ago. The benefit is the
ability to use Drill without pulling in a large amount of the Drill code
base with its Guava dependencies, etc.
* Fix the timestamp issue: use UTC throughout rather than the current mix
of UTC and local time. Ensure tests pass regardless of the timezone on the
local machine.
* Implement a 64-bit timestamp type to help with that Parquet extension
that someone is adding. I might be able to dig up a proposal that was done
a few years back. Basically, use the Int64 vector for storage, add support
for nanos in the type functions.
* Review compilation performance using the old-school Janinio + byte code
fixups vs. letting modern Java do the work. Five years ago, Java was
faster. Today, it is probably even better. Scrap all the complex code
associated with the old way of doing the work if Java is, in fact, faster.
* Fancy up our Docker and K8s support. Build that all-in-one desktop Drill
image. Ensure the Drill images are up to date on DockerHub. Finish and/or
update the K8s support: Helm chart? Something newer?
* Test Drill on the latest Java versions. Any code changes or library
issues with compiling with the latest? If so, file a JIRA with all the
library issues so they can be tackled. Fix any Drill issues.
* Create a demo data science environment in Python: the equivalent of
SqlLine, but with Pandas, charts, conversion to numpy arrays, etc. Maybe
have this be a Docker container that can run alongside the improved Drill
one. Write a blog post on Medium or whatever people use these days. Note
when to use the simpler Arrow-based stack vs. when to move up to a true DB
engine.
* Extend the Daffodil work. Address the questions from one of my emails:
can we find a common metadata format so that Daffodil is just one of
several supported sources of metadata? Allow Daffodil to describe any
supported Drill datasource. Integrate Daffodil's file format data with
"statistics" about which files hold which data. Etc.
* For someone from a marketing background: try to find out where Drill is
used today and what that new user base needs. Extra credit: figure out how
to reach similar people who may not have heard of the project, but who
would also benefit from it.

Many of those are non-trivial projects that would appeal to overachiever
types. Sounds like James can prepare a list of projects for the folks with
more typical skills and time commitments.

Thanks,

- Paul

On Sat, Jan 27, 2024 at 8:11 AM James Turton  wrote:

> Supplement: a recent article and commentary on said DBs.
>
> https://news.ycombinator.com/item?id=39119198
>
> On 2024/01/27 18:08, James Turton wrote:
> > I thought of vector database storage / format plugins for Drill  to
> > tick their AI/ML box but it isn't clear to me that doing SQL over
> > those datasets is of any use to anyone. I think that we do have other
> > interesting, if unfashionable, lines of work that we could propose.
> >
> > On 2024/01/25 14:20, Priya Sharma wrote:
> >> Hello PMCs,
> >>
> >> Google Summer of Code is the ideal opportunity for you to attract new
> >> contributors to your projects and GSoC 2024 is here.
> >>
> >> The ASF will be applying as a participating organization for GSoC 2024.
> >> As a part of the application we need you all to *mandatorily* start
> >> recording your ideas now [1] latest by 3rd Feb.
> >>
> >> There is slight change in the rules this year, just reiterating here:
> >> - For the 2024 program, there will be three options for project scope:
> >> medium at ~175 hours, large at ~350 hours and a new size: small at ~90
> >> hours.
> >>Please add "*full-time*" label to the JIRA for 350 hour project ,
> >> "*part-time*" label for 175 hours project and “*small*” for a 90 hour
> >> project.
> >>
> >> Note: They are looking to bring more open source projects in the AI/ML
> >> field into GSoC 2024, so we encourage more projects from this domain
> >> to participate.
> >>
> >> If you are a new mentor or your project is participating for the first
> >> time, please read [2][3].
> >>
> >> On behalf of the GSoC 2024 admins,
> >> Please feel free to reach out to us in case of queries or concerns.
> >>
> >> [1] https://s.apache.org/gsoc2024ideas
> >> 

Re: License headers inside Javadoc comments

2024-01-26 Thread Paul Rogers
Hi James,

For some reason, Drill started with the license headers in Javadoc
comments. The (weak) explanation I got was that we never generate Javadoc,
so it didn't really matter. Later, we started converting the headers to
regular comments when convenient.

If we were to generate Javadoc, having the license at the top of each page
as the summary for each class would probably not be something that anyone
finds useful.

I don't know how to configure the license plugin. But, I do suspect a
Python file (or shell script) could make a one-time pass over the files to
standardize headers into whatever format the team chooses. Only the first
line of each file would change.

- Paul

On Thu, Jan 25, 2024 at 11:22 PM James Turton  wrote:

> Good morning!
>
> I'd like to ask about a feature to prevent RAT from allowing license
> headers to appear inside Javadoc comments  (/**) while still requiring
> them in Java comments (/*) in .java files. Currently the Drill project
> makes use of com.mycila.license-maven-plugin to reject licenses in
> Javadoc comments because the developers at the time didn't want license
> headers cluttering the Javadoc website that is generated from the
> source. Are you aware of  a general view on Apache license headers
> appearing in Javadoc pages? If preventing them from doing so is a good
> idea, could this become a (configurable) feature in RAT?
>
> Thanks
> James Turton
>


Re: Possible Regression: Can't build current master

2024-01-25 Thread Paul Rogers
I updated my GitHub branch from master, then pulled that branch to my
computer, where I did a build. It is clean.

Now, I did have to do some git tricks to clean up the stray commits from my
botched merge. See below.

Then, I pulled cgivre/master and did a diff with my local copy of master.
There are *many* different files. So, I wonder if your master branch has
somehow diverged from the Drill master. Or, are you tracking some other
branch?

As a sanity check, I suggest you reset your master to the Drill master and
do a clean build. Some commands that might help:

git fetch origin
git checkout master
git reset --hard origin/master

Use these with caution: I used a slightly different set to update my own
branch. Caveat emptor. This assumes that your Drill clone is "origin".

- Paul


On Thu, Jan 25, 2024 at 12:48 PM Paul Rogers  wrote:

> The symbols in questions are some I modified in my recent PR. I wonder if
> there was a merge issue somewhere? The PR did get a clean build on the
> master branch.
>
> I'll try a build myself to see if I can locate the issue.
>
> - Paul
>
>
> On Thu, Jan 25, 2024 at 10:56 AM Charles Givre  wrote:
>
>> All,
>> I just rebased my local Drill on the current master and I'm getting the
>> following error when I try to build it.   Is anyone else encountering this?
>>
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile
>> (default-compile) on project vector: Compilation failure
>> [ERROR]
>> /Users/charlesgivre/github/drill/exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/UnionVectorShim.java:[78,10]
>> error: cannot find symbol
>> [ERROR]   symbol:   class UnionWriter
>> [ERROR]   location: class UnionVectorShim
>> [ERROR]
>> [ERROR] -> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>> -e switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>
>


Re: Possible Regression: Can't build current master

2024-01-25 Thread Paul Rogers
The symbols in questions are some I modified in my recent PR. I wonder if
there was a merge issue somewhere? The PR did get a clean build on the
master branch.

I'll try a build myself to see if I can locate the issue.

- Paul


On Thu, Jan 25, 2024 at 10:56 AM Charles Givre  wrote:

> All,
> I just rebased my local Drill on the current master and I'm getting the
> following error when I try to build it.   Is anyone else encountering this?
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile
> (default-compile) on project vector: Compilation failure
> [ERROR]
> /Users/charlesgivre/github/drill/exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/UnionVectorShim.java:[78,10]
> error: cannot find symbol
> [ERROR]   symbol:   class UnionWriter
> [ERROR]   location: class UnionVectorShim
> [ERROR]
> [ERROR] -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException


Re: Parquet files with microsecond columns

2024-01-22 Thread Paul Rogers
Hi Peter,

It sounds like you are on the right track: the new option is the quick
short-term solution. The best long-term solution is to generalize Drill's
date/time type, but that would take much more work. (Drill also has a bug
where the treatment of timezones is incorrect, which forces Drill to run in
the UTC time zone -- something that will also require difficult work.)

Given that JDBC works, the problem must be in the web interface, not in
your Parquet implementation. You've solved the problem with a new session
option. The web interface, however, has no sessions: if you set an option
in one call, and do your query in another, Drill will have "forgotten" your
option. Instead, there is a way to attach options to each query. Are you
using that feature?

As I recall, the JSON message to submit a query has an additional field to
hold session options. I do not recall, however, if the web UI added that
feature. Does anyone else know? Two workarounds. First, use your favorite
JSON request tool to submit a query with the option set. Second, set your
option as a system option so it is available to all sessions: ALTER SYSTEM
SET...

Thanks,

- Paul

On Mon, Jan 22, 2024 at 1:38 AM Peter Franzen  wrote:

> Hi,
>
> I am using Drill to query Parquet files that have fields of type
> timestamp_micros. By default, Drill truncates those microsecond
> values to milliseconds when reading the Parquet files in order to convert
> them to SQL timestamps.
>
> In some of my use cases I need to read the original microsecond values (as
> 64-bit values, not SQL timestamps) through Drill, but
> this doesn’t seem to be possible (unless I’ve missed something).
>
> I have explored a possible solution to this, and would like to run it by
> some developers more experienced with the Drill code base
> before I create a pull request.
>
> My idea is to add tow options similar to
> “store.parquet.reader.int96_as_timestamp" to control whether or not
> microsecond
> times and timestamps are truncated to milliseconds. These options would be
> added to “org.apache.drill.exec.ExecConstants" and
> "org.apache.drill.exec.server.options.SystemOptionManager", and to
> drill-module.conf:
>
> store.parquet.reader.time_micros_as_int64: false,
> store.parquet.reader.timestamp_micros_as_int64: false,
>
> These options would then be used in the same places as
> “store.parquet.reader.int96_as_timestamp”:
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter
>
> to create an int64 reader instead of a time/timestamp reader when the
> correspodning option is set to true.
>
> In addition to this,
> “org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must
> be altered to _not_ truncate the min and max
> values for time_micros/timestamp_micros if the corresponding option is
> true. This class doesn’t have a reference to an OptionManager, so
> my guess is that the two new options must be extractred from the
> OptionManager when the ParquetReaderConfig instance is created.
>
> Filtering on microsecond columns would be done using 64-bit values rather
> than TIME/TIMESTAMP values, e.g.
>
> select *  from  where  = 1705914906694751;
>
> I’ve tested the solution outlined above, and it seems to work when using
> sqlline and with the JDBC driver, but not with the web based interface.
> Any pointers to the relevent code for that would be appreciated.
>
> An alternative solution to the above could be to intercept all reading of
> the Parquet schemas and modifying the schema to report the
> microsecond columns as int64 columns, i.e. to completely discard the
> information that the columns contain time/timestamp values.
> This could potentially make parts of the code where it is not obvious that
> the time/timestamp properties of columns are used behave
> as expected. However, this variant would not align with how INT96
> timestamps are handled.
>
> Any thoughts on this idea for how to access microsecond values would be
> highly appreciated.
>
> Thanks,
>
> /Peter
>
>


[jira] [Resolved] (DRILL-8375) Incomplete support for non-projected complex vectors

2024-01-07 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-8375.

Resolution: Fixed

> Incomplete support for non-projected complex vectors
> 
>
> Key: DRILL-8375
> URL: https://issues.apache.org/jira/browse/DRILL-8375
> Project: Apache Drill
>  Issue Type: Bug
>        Reporter: Paul Rogers
>    Assignee: Paul Rogers
>Priority: Major
>
> The `ResultSetLoader` implementation supports all of Drill's vector types. 
> However, DRILL-8188 discovered holes in support for non-projected vectors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


UNION and LIST non-projection support in EVF

2024-01-06 Thread Paul Rogers
Hi All,

Happy New Year!

I dusted off the work to add non-projection support in EVF for the UNION
and LIST types. I believe that only REPEATED LIST is missing.

"Non-projection support" just means that you can read a JSON file that
requires a UNION or LIST vector, and tell EVF to NOT actually project that
vector. That is, if the union column is u and the list is l, and you have
other columns a, b, c, you can write "SELECT a, b, c FROM 'myFile.json'".
JSON wil read all columns, but EVF won't actually create vectors for the
columns u and l that you didn't want.

If you write a format plugin, your code can write to the "dummy" UNION
column writer, blissfully ignorant of the fact that there is no actual
vector underneath: the data just goes into the bit bucket.

The work mostly required fiddling with existing mechanisms. A new test file
demonstrates that the changes work. All the existing tests suggest that the
work didn't break anything.

Reviews greatly appreciated. https://github.com/apache/drill/pull/2867

It would be great to get this into the next release, but doing so isn't
critical, unless Luoc still needs this work.

Thanks,

- Paul


Re: Next Version

2024-01-01 Thread Paul Rogers
s things in many operators, in favour of a
> different approach where we kick any invalid data encountered while loading
> column FOO out to a generated _FOO_EXCEPTIONS VARCHAR (or VARBINARY, though
> binary data formats tend not to be malformed?) column. This would let a
> query over dirty data complete without invisible data swallowing, and would
> mean we could cut further effort on UNION support.
> >>
> >> On 2024/01/01 07:11, James Turton wrote:
> >>> Happy New Year!
> >>>
> >>> Here's another two cents. Make that five now that I scan this email
> again!
> >>>
> >>> Excluding our Docker Hub images (which are popular), Drill is
> downloaded ~1000 times a month [1] (order of magnitude, it's hard to count
> genuinely new installations from web server downloads).
> >>>
> >>> What roles are these folks in? I'm a data engineer by day and I don't
> think that we count for a large share of those downloads. The DEs I work
> with are risk averse sorts that tend to favour setups with rigid schemas
> early on and no surprises for their users at query time. Add to that a
> second stat from the download data: the biggest single download user OS is
> Windows, at about 50% [1]. Some of these users may go on to copy that
> download to a server environment but I have a theory that many of them go
> on to run embedded Drill right there on beefy Windows laptops.
> >>>
> >>> I conjecture that most of the people reaching for Drill are analysts
> or developers working _away_ from an established, shared data
> infrastructure. There may not be any shared data engineering where they
> are, or they may find themselves in a fashionable "Data Mesh" environment
> [2]. I'm probably abusing Data Mesh a bit here in that I'm told that it
> mainly proposes a federation of distinct data _teams_, rather than of data
> _systems_ but, if you entertain my cynical formulation of "Data Mesh guys!
> Silos aren't uncool any more!" just a bit, then you can well imagine why a
> user in a Data Mesh might look for something like Drill to combine data
> from different silos on their own machine. Tangentially this suggests to me
> that we should keep putting effort into: embedded Drill, Windows support,
> rapid installation and setup, low "time to insight".
> >>>
> >>> MongoDB questions still come up frequently giving a reason beyond the
> JSON files questions to think that the JSON data model is still very
> important. Wherever we decide to bound the current EVF v2 data model
> implementation, maybe we can sketch out a design of whatever is
> unimplemented in some updates to the Drill wiki pages? This would give
> other devs a head start if we decide that some unsupported complex data
> type is worth implementing down the road?
> >>>
> >>> 1. https://infra-reports.apache.org/#downloads=drill
> >>> 2. https://martinfowler.com/articles/data-mesh-principles.html
> >>>
> >>> Regards
> >>> James
> >>>
> >>> On 2024/01/01 03:16, Charles Givre wrote:
> >>>> I'll throw my .02 here...  As a user of Drill, I've only had the
> occasion to use the Union once. However, when I used it, it consumed so
> much memory, we ended up finding a workaround anyway and stopped using it.
> Honestly, since we improved the implicit casting rules, I think Drill is a
> lot smarter about how it reads data anyway. Bottom line, I do think we
> could drop the union and repeated union.
> >>>>
> >>>> The repeated lists and maps however are unfortunately something that
> does come up a bit.   Honestly, I'm not sure what work is remaining here
> but TBH Drill works pretty well at the moment with most of the data I'm
> using it for.  This would include some really nasty nested JSON objects.
> >>>>
> >>>> -- C
> >>>>
> >>>>
> >>>>> On Dec 31, 2023, at 01:38, Paul Rogers  wrote:
> >>>>>
> >>>>> Hi Luoc,
> >>>>>
> >>>>> Thanks for reminding me about the EVF V2 work. I got mostly done
> adding
> >>>>> projection for complex types, then got busy on other projects. I've
> yet to
> >>>>> tackle the hard cases: unions, repeated unions and repeated lists
> (which
> >>>>> are, in fact, repeated repeated unions).
> >>>>>
> >>>>> The code to handle unprojected fields in these areas is getting
> awfully
> >>>>> complicated. In doing that work, and then seeing a trick that Druid
> uses,
> >>>>> I

Re: Next Version

2024-01-01 Thread Paul Rogers
rent silos on their own machine. Tangentially this suggests to me
> that we should keep putting effort into: embedded Drill, Windows support,
> rapid installation and setup, low "time to insight".
> >>>
> >>> MongoDB questions still come up frequently giving a reason beyond the
> JSON files questions to think that the JSON data model is still very
> important. Wherever we decide to bound the current EVF v2 data model
> implementation, maybe we can sketch out a design of whatever is
> unimplemented in some updates to the Drill wiki pages? This would give
> other devs a head start if we decide that some unsupported complex data
> type is worth implementing down the road?
> >>>
> >>> 1. https://infra-reports.apache.org/#downloads=drill
> >>> 2. https://martinfowler.com/articles/data-mesh-principles.html
> >>>
> >>> Regards
> >>> James
> >>>
> >>> On 2024/01/01 03:16, Charles Givre wrote:
> >>>> I'll throw my .02 here...  As a user of Drill, I've only had the
> occasion to use the Union once. However, when I used it, it consumed so
> much memory, we ended up finding a workaround anyway and stopped using it.
> Honestly, since we improved the implicit casting rules, I think Drill is a
> lot smarter about how it reads data anyway. Bottom line, I do think we
> could drop the union and repeated union.
> >>>>
> >>>> The repeated lists and maps however are unfortunately something that
> does come up a bit.   Honestly, I'm not sure what work is remaining here
> but TBH Drill works pretty well at the moment with most of the data I'm
> using it for.  This would include some really nasty nested JSON objects.
> >>>>
> >>>> -- C
> >>>>
> >>>>
> >>>>> On Dec 31, 2023, at 01:38, Paul Rogers  wrote:
> >>>>>
> >>>>> Hi Luoc,
> >>>>>
> >>>>> Thanks for reminding me about the EVF V2 work. I got mostly done
> adding
> >>>>> projection for complex types, then got busy on other projects. I've
> yet to
> >>>>> tackle the hard cases: unions, repeated unions and repeated lists
> (which
> >>>>> are, in fact, repeated repeated unions).
> >>>>>
> >>>>> The code to handle unprojected fields in these areas is getting
> awfully
> >>>>> complicated. In doing that work, and then seeing a trick that Druid
> uses,
> >>>>> I'm tempted to rework the projection bits of the code to use a
> cleaner
> >>>>> approach. However, it might be better to commit the work done thus
> far so
> >>>>> folks can use it before I wander off to take another approach.
> >>>>>
> >>>>> Then, I wondered if anyone actually still uses this stuff. Do you
> still
> >>>>> need the code to handle non-projection of complex types?
> >>>>>
> >>>>> Of course, perhaps no one will ever need the hard cases: I've never
> been
> >>>>> convinced that unions, repeated lists, or arrays of repeated lists
> are
> >>>>> things that any sane data engineer will want to use -- or use more
> than
> >>>>> once.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> - Paul
> >>>>>
> >>>>>
> >>>>> On Sat, Dec 30, 2023 at 10:26 PM James Turton 
> wrote:
> >>>>>
> >>>>>> Hi Luoc and Drill devs!
> >>>>>>
> >>>>>> It's best to email Paul directly since he doesn't follow these lists
> >>>>>> closely. In the meantime I've prepared a PR of backported fixes for
> >>>>>> 1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty
> >>>>>> upgrade that Maksym is working on, and which looks close to done,
> >>>>>> included? There's at least one CVE  applicable to our current
> version of
> >>>>>> Netty...
> >>>>>>
> >>>>>> Regards
> >>>>>> James
> >>>>>>
> >>>>>>
> >>>>>> 1. https://github.com/apache/drill/pull/2860
> >>>>>>
> >>>>>> On 2023/12/11 04:41, luoc wrote:
> >>>>>>> Hello all,
> >>>>>>>1.22 will be a more stable version. This is a digression: Is
> Paul
> >>>>>> still interested in participating in the EVF V2 refactoring in the
> >>>>>> framework? I would like to offer time to assist him.
> >>>>>>> luoc
> >>>>>>>
> >>>>>>>> 2023年12月9日 01:01,Charles Givre  写道:
> >>>>>>>>
> >>>>>>>> Hello all,
> >>>>>>>> Happy Friday everyone!   I wanted to raise the topic of getting a
> Drill
> >>>>>> minor release out the door before the end of the year. My opinion
> is that
> >>>>>> I'd really like to release Drill 1.22 once the integration with
> Apache
> >>>>>> Daffodil is complete, but it sounds like that is still a few weeks
> away.
> >>>>>>>> What does everyone think about issuing a maintenance release
> before the
> >>>>>> end of the year?  There are a number of singificant fixes including
> some
> >>>>>> security updates and a major bug in the ES plugin that basically
> makes it
> >>>>>> unusable.
> >>>>>>>> Best,
> >>>>>>>> -- C
> >>>>>>
> >>>
> >>
> >
>
>


Re: Next Version

2023-12-30 Thread Paul Rogers
Hi Luoc,

Thanks for reminding me about the EVF V2 work. I got mostly done adding
projection for complex types, then got busy on other projects. I've yet to
tackle the hard cases: unions, repeated unions and repeated lists (which
are, in fact, repeated repeated unions).

The code to handle unprojected fields in these areas is getting awfully
complicated. In doing that work, and then seeing a trick that Druid uses,
I'm tempted to rework the projection bits of the code to use a cleaner
approach. However, it might be better to commit the work done thus far so
folks can use it before I wander off to take another approach.

Then, I wondered if anyone actually still uses this stuff. Do you still
need the code to handle non-projection of complex types?

Of course, perhaps no one will ever need the hard cases: I've never been
convinced that unions, repeated lists, or arrays of repeated lists are
things that any sane data engineer will want to use -- or use more than
once.

Thanks,

- Paul


On Sat, Dec 30, 2023 at 10:26 PM James Turton  wrote:

> Hi Luoc and Drill devs!
>
> It's best to email Paul directly since he doesn't follow these lists
> closely. In the meantime I've prepared a PR of backported fixes for
> 1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty
> upgrade that Maksym is working on, and which looks close to done,
> included? There's at least one CVE  applicable to our current version of
> Netty...
>
> Regards
> James
>
>
> 1. https://github.com/apache/drill/pull/2860
>
> On 2023/12/11 04:41, luoc wrote:
> > Hello all,
> >1.22 will be a more stable version. This is a digression: Is Paul
> still interested in participating in the EVF V2 refactoring in the
> framework? I would like to offer time to assist him.
> >
> > luoc
> >
> >> 2023年12月9日 01:01,Charles Givre  写道:
> >>
> >> Hello all,
> >> Happy Friday everyone!   I wanted to raise the topic of getting a Drill
> minor release out the door before the end of the year.   My opinion is that
> I'd really like to release Drill 1.22 once the integration with Apache
> Daffodil is complete, but it sounds like that is still a few weeks away.
> >>
> >> What does everyone think about issuing a maintenance release before the
> end of the year?  There are a number of singificant fixes including some
> security updates and a major bug in the ES plugin that basically makes it
> unusable.
> >> Best,
> >> -- C
>
>


Re: assistance needed debugging drill + daffodil

2023-12-07 Thread Paul Rogers
Hi Mike,

I wonder if you've got an array in there somewhere? Either in the data, or
you're creating an array in your code in response to the data?

If you have just scalars, then all you need to do is start a row, write the
scalars, and end the row. The starting and ending are done automagically by
the framework. If your row has a (non-repeated) map, the same rules apply.
This pattern works because every row has zero or one values for each scalar
(zero values means the value is null or default).

However, if you create an array, you need to help the row set loader a bit:
you have to tell it where one array element ends and another begins. Thus,
you must call the end element method on each array for each element. If you
have nested arrays, you must handle the events for each layer of array. In
this case, if you have a repeated map, you have an array in which each
element is a map: you have to tell the array where one map ends and the
next one begins.

So, your description is a couple of scalars, one (non-repeated) map and a
couple of scalar map entries. You should not be hitting the array code
shown in your message. That you are suggests to me that you are reading
something as an array. Either a) change it to read as a non-repeated map,
or b) insert the required array events.

Take a look at the many tests for arrays and nested arrays for the required
calls.

Thanks,

- Paul


On Thu, Dec 7, 2023 at 2:37 PM Mike Beckerle  wrote:

> I am blocked on getting a test (testComplexQuery3) to work that contains a
> row of a couple int columns plus a map column where that map contains 2
> additional int fields.
> Rows just containing simple integer fields work. Next step is let a column
> of the top level row be a map that is a pair of additional fields, and
> that's failing.
>
> The test fails in the assert here:
>
> @Override
> public void endArrayValue() {
>   assert state == State.IN_ROW;  // FAILS HERE WITH State.IDLE
>   for (AbstractObjectWriter writer : writers) {
> writer.events().endArrayValue();
>   }
> }
>
> (That is at line 306 of AbstractTupleWriter.java)
>
> This recursively calls endArrayValue on the child writers, and the
> state of the first of these is IDLE, not IN_ROW, so it fails the
> assert.
>
> This must mean I am doing something wrong with the setup/creation of
> the metadata for the map column (line 193 of
> DrillDaffodilSchemaVisitor.java) ...
>
> and/or creating and populating the data for this map column (line 177
> of DaffodilDrillInfosetOutputter.java).
>
> Any insights would be helpful.
>
> The PR is here: https://github.com/apache/drill/pull/2836
>
> My fork is here: https://github.com/mbeckerle/drill/tree/daffodil-2835
> (that's branch daffodil-2835)
>
> Note this fork works with the current 3.7.0-SNAPSHOT version of Apache
> Daffodil, but the features in Daffodil it needs are not yet in an
> "official" release.
>
> On Linux, in daffodil 'sbt publishM2' before rebuilding drill should
> do it once you have everything installed needed to build daffodil (See
> BUILD.md in Daffodil).
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>


Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Paul Rogers
Hi Mike,

Earlier on, there were two approaches discussed:

1. Using a Daffodil schema to map to a Drill schema, and use Drill's
existing schema mechanisms for all of Drill's existing input formats.
2. Using a Daffodil-specific reader so that Daffodil does the data parsing.

Some of my earlier answers assumed you were doing option 1. The code shows
you are doing option 2. There are pros and cons, but let's just focus on
option 2 for now.

You need a way for a reader (running on Drillbit 2) to get a schema from a
query (planned on Drillbit 1). How does the Daffodil schema get from Node 1
to Node 2? Charles suggested ZK; I suggested that is not such a great idea,
for a number of reasons. A more "Drill-like" way would be to include the
Daffodil schema in the query plan: either as JSON or as a binary blob. The
planner attaches the schema when creating the reader definition; the reader
deserializes the schema at run time.

I believe you said schemas can be large. So, you could instead serialize a
reference. To do that, you'd need a location visible to all Drill nodes:
HDFS, S3, web server, etc. A crude-but-effective approach to get started is
the one mentioned for Drill's own metadata: the schema must reside in the
same directory as the data. This opens up issues with update race
conditions, as noted earlier. But, it could work if you are "careful." If
there is a Daffodil schema server, that would be better.

Given all that, your DaffodilBatchReader is generally headed in the right
direction. The same is true of DaffodilDrillInfosetOutputter, though, for
performance, you'll want to cache the column readers rather than do
name-based lookups for every column for every row. (Drill is designed to
read billions of rows; that's a lot of lookups!) But, that can be optimized
once things work.

You'll soon be at a place where you'll want to do some debugging. The
S-L-O-W way is to build Drill, fire of a query, and sort out what went
wrong, perhaps attaching a debugger. Another slow way is to fire up a
Drillbit in your test and run a query. (Such a test is a great integration
test, however.)

A good way to debug is to create a test that includes just your reader and
surrounding plumbing. This way, you can set up very specific cases and
easily debug, in a single thread, right from your IDE. The JSON reader
tests may have some examples. Charles may have others.

Thanks,

- Paul

On Wed, Oct 18, 2023 at 4:06 PM Charles Givre  wrote:

> Got it.  I’ll review today and tomorrow and hopefully we can get you
> unblocked.
> Sent from my iPhone
>
> > On Oct 18, 2023, at 18:01, Mike Beckerle  wrote:
> >
> > I am very much hoping someone will look at my open PR soon.
> > https://github.com/apache/drill/pull/2836
> >
> > I am basically blocked on this effort until you help me with one key area
> > of that.
> >
> > I expect the part I am puzzling over is routine to you, so it will save
> me
> > much effort.
> >
> > This is the key area in the DaffodilBatchReader.java code:
> >
> >  // FIXME: Next, a MIRACLE occurs.
> >  //
> >  // We get the dfdlSchemaURI filled in from the query, or a default
> config
> > location
> >  // We get the rootName (or null if not supplied) from the query, or a
> > default config location
> >  // We get the rootNamespace (or null if not supplied) from the query, or
> > a default config location
> >  // We get the validationMode (true/false) filled in from the query or a
> > default config location
> >  // We get the dataInputURI filled in from the query, or from a default
> > config location
> >  //
> >  // For a first cut, let's just fake it. :-)
> >  boolean validationMode = true;
> >  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
> >  String rootName = null;
> >  String rootNamespace = null;
> >  URI dataInputURI = new URI("data/complexArray1.dat");
> >
> >
> > I imagine this is just a few lines of code to grab these from the query,
> > and i don't even care about config files for now.
> >
> > I gave up on trying to figure out how to do this myself. It was actually
> > quite unclear from looking at the other format plugins. The way Drill
> does
> > configuration is obviously motivated by the distributed architecture
> > combined with pluggability, but all that combined with the negotation
> over
> > schemas which extends into runtime, and it all became quite muddy to me.
> I
> > think what I need is super straightforward, so i figured I should just
> > ask.
> >
> > This is just to get enough working (against local files only) that I can
> be
> > unblocked on creating and testing the rest of the Daffodil-to-Drill
> > metadata bridge a

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Paul Rogers
 be helpful.
> >>
> > Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can
> work similarly.
> >
> > This will take me a few more days to get to a pull request. The first
> one will be initial review, i.e., not intended to merge without more tests.
> Probably it will support only integer data fields, but should support lots
> of data shapes including vectors, choices, sequences, nested records, etc.
> >
> > Thanks for the help.
> >
> >>
> >>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle  <mailto:mbecke...@apache.org>> wrote:
> >>>
> >>> So when a data format is described by a DFDL schema, I can generate
> >>> equivalent Drill schema (TupleMetadata). This schema is always
> complete. I
> >>> have unit tests working with this.
> >>>
> >>> To do this for a real SQL query, I need the DFDL schema to be
> identified on
> >>> the SQL query by a file path or URI.
> >>>
> >>> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
> >>>
> >>> Next, assuming I have the DFDL schema identified, I generate an
> equivalent
> >>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
> >>>
> >>> What objects do I call, or what classes do I have to create to make
> this
> >>> Drill TupleMetadata available to Drill so it uses it in all the ways a
> >>> static Drill schema can be useful?
> >>>
> >>> I just need pointers to the code that illustrate how to do this. Thanks
> >>>
> >>> -Mike Beckerle
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers  <mailto:par0...@gmail.com>> wrote:
> >>>
> >>>> Mike,
> >>>>
> >>>> This is a complex question and has two answers.
> >>>>
> >>>> First, the standard enhanced vector framework (EVF) used by most
> readers
> >>>> assumes a "pull" model: read each record. This is where the next()
> comes
> >>>> in: readers just implement this to read the next record. But, the code
> >>>> under EVF works with a push model: the readers write to vectors, and
> signal
> >>>> the next record. EVF translates the lower-level push model to the
> >>>> higher-level, easier-to-use pull model. The best example of this is
> the
> >>>> JSON reader which uses Jackson to parse JSON and responds to the
> >>>> corresponding events.
> >>>>
> >>>> You can thus take over the task of filling a batch of records. I'd
> have to
> >>>> poke around the code to refresh my memory. Or, you can take a look at
> the
> >>>> (quite complex) JSON parser, or the EVF itself to see what it does.
> There
> >>>> are many unit tests that show this at various levels of abstraction.
> >>>>
> >>>> Basically, you have to:
> >>>>
> >>>> * Start a batch
> >>>> * Ask if you can start the next record (which might be declined if the
> >>>> batch is full)
> >>>> * Write each field. For complex fields, such as records, recursively
> do the
> >>>> start/end record work.
> >>>> * Mark the record as complete.
> >>>>
> >>>> You should be able to map event handlers to EVF actions as a result.
> Even
> >>>> though DFDL wants to "drive", it still has to give up control once the
> >>>> batch is full. EVF will then handle the (surprisingly complex) task of
> >>>> finishing up the batch and returning it as the output of the Scan
> operator.
> >>>>
> >>>> - Paul
> >>>>
> >>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle  <mailto:mbecke...@apache.org>>
> >>>> wrote:
> >>>>
> >>>>> Daffodil parsing generates event callbacks to an InfosetOutputter,
> which
> >>>> is
> >>>>> analogous to a SAX event handler.
> >>>>>
> >>>>> Drill is expecting an iterator style of calling next() to advance
> through
> >>>>> the input, i.e., Drill has the control thread and expects to do pull
> >>>>> parsing. At least from the code I studied in the format-xml contrib.
> >>>>>
> >>>>> Is there any alternative? Before I dig into creating another one of
> these
> >>>>> co-routine-style control inversions (which have proven to be
> problematic
> >>>>> for performance.
>
>


Re: [apache/drill] WIP: Preliminary Review on adding Daffodil to Drill (PR #2836)

2023-10-15 Thread Paul Rogers
Hi Mike,

Congrats on the PR. I'll take a look soon.

You asked about initialization. Initialization is a bit tricky in a
distributed system such as Drill. There are a number of things
"initialization" could mean:

* Global, one-time initialization (per Drillbit): Unlike Druid, Drill has
no "lifecycle" that you can plug into, sadly. Instead, you can use a
singleton created on demand. Drill is multi-threaded. so singleton creation
must be protected by a lock.
* Per-query initialization: there is no such thing in Drill, since queries
are distributed. In particular, queries will execute, in general, on a node
different than the one that did the planning.
* Per-fragment initialization: this is also hard: there is no code you can
provide that connects to the fragment lifecycle, unless you create your own
operator (by extending the existing one), but this is not at all easy (nor
the best approach).
* Per-reader (i.e. file) initialization: Such work is done in the open()
call for each reader (which, in EVF2, is actually done in the reader
constructor.) De-initialization can be done in close() for the reader. Of
course, these two methods could refer to a per-Drillbit global singleton,
if that is what is wanted.

One would think that plugins such as those for an RDBMS (i.e. the JDBC
plugin) would maintain state about the target DB so that sequential queries
against the same schema could use cached metadata. Drill wasn't designed
for this, but it can be done. The Drill metastore probably does some
caching, but I'm not as familiar with that code as I'd like.

For Daffodil, the logical approach would be to cache each schema when it is
first needed. In a cluster, each Drillbit would end up caching each schema,
since Drill randomly routes connections to Drillbits. Of course, with a
cache, we'd have to detect when the cache becomes stale (that is, a new
version of the file is created). And, we'd have to handle race conditions
(a new version of the file is written exactly when Drill tries to read it,
and Drill sees a partial file.)

In short, it is best to identify exactly what you want to initialize; both
for planning and execution. Then, we can point you to a good place to do
that work.

- Paul

On Fri, Oct 13, 2023 at 8:17 PM Mike Beckerle  wrote:

> My PR needs input from drill developers.
>
> Please look for TODO and FIXME in this PR and help me get to where I can
> initialize this plugin.
>
> In general I copied things from format-xml contrib, but then took ideas
> from Json. I was unable to figure out how initialization works from the
> Excel plugin.
>
> The metadata bridge is here, and a stub of the data bridge - handles only
> simple type "INT" right now, and of course doesn't compile yet.
>
> https://github.com/apache/drill/pull/2836
>
>


Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-12 Thread Paul Rogers
approach the implementation.

I hope this provides a few hints to get you started.

- Paul


On Thu, Oct 12, 2023 at 11:58 AM Mike Beckerle  wrote:

> So when a data format is described by a DFDL schema, I can generate
> equivalent Drill schema (TupleMetadata). This schema is always complete. I
> have unit tests working with this.
>
> To do this for a real SQL query, I need the DFDL schema to be identified on
> the SQL query by a file path or URI.
>
> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>
> Next, assuming I have the DFDL schema identified, I generate an equivalent
> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>
> What objects do I call, or what classes do I have to create to make this
> Drill TupleMetadata available to Drill so it uses it in all the ways a
> static Drill schema can be useful?
>
> I just need pointers to the code that illustrate how to do this. Thanks
>
> -Mike Beckerle
>
>
>
>
>
>
>
>
>
>
> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers  wrote:
>
> > Mike,
> >
> > This is a complex question and has two answers.
> >
> > First, the standard enhanced vector framework (EVF) used by most readers
> > assumes a "pull" model: read each record. This is where the next() comes
> > in: readers just implement this to read the next record. But, the code
> > under EVF works with a push model: the readers write to vectors, and
> signal
> > the next record. EVF translates the lower-level push model to the
> > higher-level, easier-to-use pull model. The best example of this is the
> > JSON reader which uses Jackson to parse JSON and responds to the
> > corresponding events.
> >
> > You can thus take over the task of filling a batch of records. I'd have
> to
> > poke around the code to refresh my memory. Or, you can take a look at the
> > (quite complex) JSON parser, or the EVF itself to see what it does. There
> > are many unit tests that show this at various levels of abstraction.
> >
> > Basically, you have to:
> >
> > * Start a batch
> > * Ask if you can start the next record (which might be declined if the
> > batch is full)
> > * Write each field. For complex fields, such as records, recursively do
> the
> > start/end record work.
> > * Mark the record as complete.
> >
> > You should be able to map event handlers to EVF actions as a result. Even
> > though DFDL wants to "drive", it still has to give up control once the
> > batch is full. EVF will then handle the (surprisingly complex) task of
> > finishing up the batch and returning it as the output of the Scan
> operator.
> >
> > - Paul
> >
> > On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle 
> > wrote:
> >
> > > Daffodil parsing generates event callbacks to an InfosetOutputter,
> which
> > is
> > > analogous to a SAX event handler.
> > >
> > > Drill is expecting an iterator style of calling next() to advance
> through
> > > the input, i.e., Drill has the control thread and expects to do pull
> > > parsing. At least from the code I studied in the format-xml contrib.
> > >
> > > Is there any alternative? Before I dig into creating another one of
> these
> > > co-routine-style control inversions (which have proven to be
> problematic
> > > for performance.
> > >
> >
>


Re: Drill expects pull parsing? Daffodil is event callbacks style

2023-10-11 Thread Paul Rogers
Mike,

This is a complex question and has two answers.

First, the standard enhanced vector framework (EVF) used by most readers
assumes a "pull" model: read each record. This is where the next() comes
in: readers just implement this to read the next record. But, the code
under EVF works with a push model: the readers write to vectors, and signal
the next record. EVF translates the lower-level push model to the
higher-level, easier-to-use pull model. The best example of this is the
JSON reader which uses Jackson to parse JSON and responds to the
corresponding events.

You can thus take over the task of filling a batch of records. I'd have to
poke around the code to refresh my memory. Or, you can take a look at the
(quite complex) JSON parser, or the EVF itself to see what it does. There
are many unit tests that show this at various levels of abstraction.

Basically, you have to:

* Start a batch
* Ask if you can start the next record (which might be declined if the
batch is full)
* Write each field. For complex fields, such as records, recursively do the
start/end record work.
* Mark the record as complete.

You should be able to map event handlers to EVF actions as a result. Even
though DFDL wants to "drive", it still has to give up control once the
batch is full. EVF will then handle the (surprisingly complex) task of
finishing up the batch and returning it as the output of the Scan operator.

- Paul

On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle  wrote:

> Daffodil parsing generates event callbacks to an InfosetOutputter, which is
> analogous to a SAX event handler.
>
> Drill is expecting an iterator style of calling next() to advance through
> the input, i.e., Drill has the control thread and expects to do pull
> parsing. At least from the code I studied in the format-xml contrib.
>
> Is there any alternative? Before I dig into creating another one of these
> co-routine-style control inversions (which have proven to be problematic
> for performance.
>


Re: Question about Drill internal data representation for Daffodil tree infosets

2023-10-11 Thread Paul Rogers
Mike,

Just to echo Charles, thanks for the work; sounds like you are making good
progress.

The question you asked is tricky. Charles is right, the type of the data
structure is a map. The output you showed appears to be from  the sqlline
tool. If so, then it helps to understand that sqlline "cheats" by
converting maps to strings for display, making it look like you have a
string column.

Also, remember that Drill uses the standard JSON structure internally, just
as you described. However, referencing any column projects it to the top
level. Clients don't understand complex JSON types (maps, arrays, etc.
Sqlline compensates by converting the data to strings for display.

- Paul

On Tue, Oct 10, 2023 at 12:55 PM Charles Givre  wrote:

> Hi Mike,
> Thanks for all the work you are doing on Drill.
>
> To answer your question, sub1 should be treated as a map in Drill.  You
> can verify this with the following query:
>
> SELECT drillTypeOf(sub1) FROM...
>
> In general, I'm pretty sure that Drill doesn't output strings that look
> like JSON objects unless they actually are complex objects.
>
> Take a look here for data type functions:
> https://drill.apache.org/docs/data-type-functions/
> Best,
> -- C
>
>
> > On Oct 10, 2023, at 7:56 AM, Mike Beckerle  wrote:
> >
> > I am trying to understand the options for populating Drill data from a
> > Daffodil data parse.
> >
> > Suppose you have this JSON
> >
> > {"parent": { "sub1": { "a1":1, "a2":2}, sub2:{"b1":3, "b2":4, "b3":5}}}
> >
> > or this equivalent XML:
> >
> > 
> >  12
> >  345
> > 
> >
> > Unlike those texts, Daffodil is going to have a tree data structure
> where a
> > parent node contains two child nodes sub1 and sub2, and each of those has
> > children a1, a2, and b1, b2, b3 respectively.
> > It's analogous roughly to the DOM tree of the XML, or the tree of nested
> > JSON map nodes you'd get back from a JSON parse of that text.
> >
> > In Drill to query the JSON like:
> >
> > select parent.sub1 from myStructure
> >
> > gives you back single column containing what seems to be a string like
> >
> > |sub1|
> > --
> > | { "a1":1, "a2":2}  |
> >
> > So, my question is this. Is this actually a string in Drill, (what is the
> > type of sub1?) or is sub1 actually a Drill data row/map node value with
> two
> > node children, that just happens to print out looking like a JSON string?
> >
> > Thanks for any insight here.
> >
> > Mike Beckerle
> > Apache Daffodil PMC | daffodil.apache.org
> > OGF DFDL Workgroup Co-Chair |
> www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> > Owl Cyber Defense | www.owlcyberdefense.com
>
>


Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-09-13 Thread Paul Rogers
Hi Mike,

Looks like you are wrestling with two separate issues. The first is how to
read the encoded data that you showed. In Drill, each data format generally
needs its own reader. Drill's reader operator provides all the plumbing
needed to handle multiple format readers, pack data into vectors, handle
projection and all the rest. But, to actually parse a stream of bytes into
data values, Drill needs a reader (AKA format plugin).

If you were to write such a reader for your encoded data format, then each
parse of a data value would write that value to a vector, potentially
creating the vector as needed (and back filling null values). All this is
handled automagically by the enhanced vector framework (EVF).

Now, you *could* do something that reads the encoded records, emits JSON,
and lets the JSON parser parse it again. I suspect you'd find that doing so
is a) much slower, and b) more work than just creating the required reader.
Either way, you'll need code for each format that Daffodil supports but
which Drill does not yet support.

One quick answer: avoid the UNION type if you can. It works, but barely. It
is slow, inefficient, not supported in many operators, and is unknown to
client libraries. Since Daffodil is all about schemas, use the schema to
figure out the correct type. If Daffodil allows UNION types, only then
would it make sense to map them to Drill's UNION type, and deal with the
many limitations.

Finally, on to your query. Let's reference the JSON form. Let's assume that
if you had a COBOL parser, it would end up with the same vector structure
as the JSON parser would produce: different ways to parse data, but same
internal data structures. To do the query you want, you'd have to:

* Flatten the content of `record` in each record. That is, JSON would read
the above as a set of records, each of which has one field called `record`
which is an array of maps (i.e. a repeated map.) Flattening produces a
stream of maps.
* Then, you'd project `a` and `b` to the top level, giving you two
top-level fields called `a` and `b`. Alternatively, project `a.a1`, `a.a2`,
`b.b1` and `b.b2` to the top level. The values for the "missing" map will
be SQL NULL.
* Finally, state your query as usual SELECT b1, b2 WHERE b1 > 10 FROM
...nested unpack queries here...

As noted in a previous response, SQL (and Drill) doesn't have the
expressiveness to do complex queries while leaving data in its original
structured form. That said, this query might actually work:

SELECT `record` WHERE `record`.`b`.`b1` > 10 FROM 

I *think* this will return a set of records, with a `record` array, a `b`
map, and `b1` and `b2` members that satisfies the query. This works because
the '`record`.`b`.`b1` > 10' expression will be FALSE if b1 (or b) is NULL.
Still, for the sake of clients, you'd want to flatten the results to the
top level.

Charles is really the query expert, he might have other tricks that he's
found that work better.

Drill is smart enough to push projection down into the reader: that's one
of the fancy bits that EVF handles. EVF will notice that we only want
`record`.`b`.`b1` and `b2` and won't project the map `a` or any of its
contents. When the reader provides those values, then will simply go into
the bit bucket. (Caveat: there are some limitations on this feature: I have
some long-delayed fixes that you might need.)

I hope this helps.

Thanks,

- Paul

On Wed, Sep 13, 2023 at 8:09 AM Mike Beckerle  wrote:

> I'm thinking whether a first prototype of DFDL integration to Drill should
> just use JSON.
>
> But please consider this JSON:
>
> { "record": [
> { "a": { "a1":5, "a2":6 } },
> { "b": { "b1":55, "b2":66, "b3":77 } }
> { "a": { "a1":7, "a2":8 } },
> { "b": { "b1":77, "b2":88, "b3":99 } }
>   ] }
>
> It corresponds to this text data file, parsed using Daffodil:
>
> 105062556677107082778899
>
> The file is a stream of records. The first byte is a tag value 1 for type
> 'a' records, and 2 for type 'b' records.
> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
> a2. They are integers.
> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
> and b3. They are integers.
> This kind of format is very common, even textualized like this (from COBOL
> programs for example)
>
> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
> (and ... does this require the experimental Union feature?)
>
> b1, b2
> -
> (55, 66)
> (77, 88)
>
> I ask because in an XML Schema or DFDL schema choices with dozens of
> 'branches' are very common.
> Ex: schema for the above data:
>
> 
>
>   
>   
>
> 
> ... many child elements let's say named a1, a2, ...
>  
>
>   
>   
>
> 
> ... many child elements let's say named b1, b2, b3
> ...
>  
>

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-09-13 Thread Paul Rogers
Hi Mike,

I believe I sent a detailed response to this. Did it get through? If not,
I'll try sending it again...

- Paul

On Wed, Sep 13, 2023 at 6:44 AM Mike Beckerle  wrote:

> ... sound of crickets on a summer night .
>
> It would really help me if I could get a response to this inquiry, to help
> me better understand Drill.
>
> I do realize people are busy, and also this was originally sent Aug 25,
> which is the wrong time of year to get timely response to anything.
> Hence, this re-up of the message.
>
>
> On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle 
> wrote:
>
> > Below is a small JSON output from Daffodil and below that is the same
> > Infoset output as XML.
> > (They're inline in this message, but I also attached them as files)
> >
> > This is just a parse of a small PCAP file with a few ICMP packets in it.
> > It's an example DFDL schema used to illustrate binary file parsing.
> >
> > (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
> > component schema: https://github.com/DFDLSchemas/ethernetIP)
> >
> > My theory is that Drill queries against these should be identical to
> > obtain the same output row contents.
> > That is, since this data has the same schema, whether it is JSON or XML
> > shouldn't affect how you query it.
> > To do that the XML Reader will need the XML schema (or some hand-provided
> > metadata) so it knows what is an array. (Specifically PCAP.Packet is the
> > array.)
> >
> > E.g., if you wanted to get the IPSrc and IPDest fields in a table from
> all
> > ICMP packets in this file, that query should be the same for the JSON and
> > the XML data.
> >
> > First question: Does that make sense? I want to make sure I'm
> > understanding this right.
> >
> > Second question, since I don't really understand Drill SQL yet.
> >
> > What is a query that would pluck the IPSrc.value and IPDest.value from
> > this data and make a row of each pair of those?
> >
> > The top level is a map with a single element named PCAP.
> > The "table" is PCAP.Packet which is an array (of maps).
> > And within each array item's map the fields of interest are within
> > LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
> > (so maybe IPv4Header is the table?)
> > The two fields within there are IPSrc.value (AS src) and IPDest.value (AS
> > dest)
> >
> > I'm lost on how to tell the query that the table is the array
> PCAP.Packet,
> > or the IPv4Header within those maybe?
> >
> > Maybe this is easy, but I'm just not grokking it yet so I could use some
> > help here.
> >
> > Thanks in advance.
> >
> > {
> > "PCAP": {
> > "PCAPHeader": {
> > "MagicNumber": "D4C3B2A1",
> > "Version": {
> > "Major": "2",
> > "Minor": "4"
> > },
> > "Zone": "0",
> > "SigFigs": "0",
> > "SnapLen": "65535",
> > "Network": "1"
> > },
> > "Packet": [
> > {
> > "PacketHeader": {
> > "Seconds": "1371631556",
> > "USeconds": "838904",
> > "InclLen": "74",
> > "OrigLen": "74"
> > },
> > "LinkLayer": {
> > "Ethernet": {
> > "MACDest": "005056E01449",
> > "MACSrc": "000C29340BDE",
> > "Ethertype": "2048",
> > "NetworkLayer": {
> > "IPv4": {
> > "IPv4Header": {
> > "Version": "4",
> > "IHL": "5",
> > "DSCP": "0",
> > "ECN": "0",
> > "Length": "60",
> > "Identification": "55107",
> > "Flags": "0",
> > "FragmentOffset": "0",
> > "TTL": "128",
> > "Protocol": "1",
> > "Checksum": "11123",
> > "IPSrc": {
> > "value": "192.168.158.139"
> > },
> > "IPDest": {
> > "value": "174.137.42.77"
> > },
> > "ComputedChecksum": "11123"
> > },
> > "Protocol": "1",
> > "ICMPv4": {
> > "Type": "8",
> > "Code": "0",
> > "Checksum": "10844",
> > "EchoRequest": {
> > "Identifier": "512",
> > "SequenceNumber": "8448",
> > "Payload":
> > "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
> > }
> > }
> > }
> > }
> > }
> > }
> > },
> > {
> > "PacketHeader": {
> > "Seconds": "1371631557",
> > "USeconds": "55699",
> > "InclLen": "74",
> > "OrigLen": "74"
> > },
> > "LinkLayer": {
> > "Ethernet": {
> > "MACDest": "000C29340BDE",
> > "MACSrc": "005056E01449",
> > "Ethertype": "2048",
> > "NetworkLayer": {
> > "IPv4": {
> > "IPv4Header": {
> > "Version": "4",
> > "IHL": "5",
> > "DSCP": "0",
> > "ECN": "0",
> > "Length": "60",
> > "Identification": "30433",
> > "Flags": "0",
> > "FragmentOffset": "0",
> > "TTL": "128",
> > "Protocol": "1",
> > "Checksum": "35797",
> > "IPSrc": {
> > "value": "174.137.42.77"
> > },
> > "IPDest": {
> > "value": "192.168.158.139"
> > },
> > "ComputedChecksum": "35797"
> > },
> > "Protocol": "1",
> > "ICMPv4": {
> > "Type": "0",
> > "Code": "0",
> > "Checksum": "12892",
> > "EchoReply": {
> > "Identifier": "512",
> > "SequenceNumber": "8448",
> > "Payload":
> > "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
> > }
> > }
> > }
> > }
> > }
> > }
> > },
> > {
> > "PacketHeader": {
> > "Seconds": "1371631557",
> > "USeconds": "840049",
> > "InclLen": "74",
> > "OrigLen": "74"
> > },
> > "LinkLayer": {
> > "Ethernet": {
> > "MACDest": "005056E01449",
> > "MACSrc": "000C29340BDE",
> 

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-08-25 Thread Paul Rogers
Hi Mike,

You asked about how to work with nested data items. As noted in a previous
email, this can be a bit tricky. Drill uses SQL, and SQL does not have good
native support for structured data: it was designed in the 1970's for
record oriented data (tuples). Several attempts were made to extend SQL for
structured data, but they didn't really catch on. The one thing that seems
to have "stuck" are the JSON extensions: a field can be of a JSON type,
then you use various functions to work with the data nested within the
JSON. Not very satisfying, but it seems to work: Apache Druid went this
route, for example.

Drill provides the ability to reference a structured item, but doing so
implicitly projects that item to the top level. Suppose we want to display
statistics about packet length. We want only
Packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header.Length:

SELECT
  LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header.Length AS Length
FROM ...

The above picks out the item you want (I'm supposing that all the layers
are simple maps), but it projects the item to the top level. There is no
syntax in SQL that lets us say, "pick out just that one item, but leave the
existing nested structure". That is, there is no way to say, "Within
IPV4Header, keep Length and IPSrc but skip all the others." Oddly, the EVF
code can do such a projection, but the instructions to do so must come from
the provided schema, not the SQL statement.

The second issue concerns the client using Drill. SQL clients know nothing
about structured data. You could not get Airflow or Tableau or Pandas to
understand the Packet and do anything useful with it: all SQL tools expect
a flattened record. (I'm sure some of these tools can work with data
encoded as JSON, so that is perhaps an option, though it has all manner of
issues.) Indeed, neither ODBC nor JDBC understand structured data. One
would have to use Drill's native API, which is not for the faint of heart.

So, a reasonable goal would be to use Drill to query structured data AND to
project that data into a flat record structure that the client can consume.
This is where you'd need the flatten operator, etc. We'd have to remember
that flattening works down one branch of a tree: one cannot flatten two or
more sibling arrays. Drill also supports lateral joins, which is the fancy
SQL way to express flattening.

You asked, "What is a query that would pluck the IPSrc.value and
IPDest.value from this data and make a row of each pair of those?"

SELECT
  LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header.IPSrc.value AS IPSrc,
  LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header.IPDest.value AS IPDest
FROM ...

Or:

SELECT
  header.IPSrc.value AS IPSrc,
  header.IPDest.value AS IPDest
FROM
  SELECT
LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header AS header
  FROM ...

This gives you an output tuple with two columns: (IPSrc, IPDest).

Internally, Drill will notice that you don't want most of the maps and leaf
values. EVF will do the magic to discard them at scan time. A later project
operator will take the remaining rump maps and project the two remaining
values to the top level. Kinda confusing, but it should work.

Just a side comment: if the "value" fields of "IPSrc" and "IPDest" are just
a syntax convention, it would be handy to automatically trim away the
value, and instead treat IPSrc and IPDest as the scalar values. We do
something like this for the Mongo extended JSON types in the JSON reader.

Thanks,

- Paul



On Fri, Aug 25, 2023 at 4:39 PM Mike Beckerle  wrote:

> Below is a small JSON output from Daffodil and below that is the same
> Infoset output as XML.
> (They're inline in this message, but I also attached them as files)
>
> This is just a parse of a small PCAP file with a few ICMP packets in it.
> It's an example DFDL schema used to illustrate binary file parsing.
>
> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
> component schema: https://github.com/DFDLSchemas/ethernetIP)
>
> My theory is that Drill queries against these should be identical to
> obtain the same output row contents.
> That is, since this data has the same schema, whether it is JSON or XML
> shouldn't affect how you query it.
> To do that the XML Reader will need the XML schema (or some hand-provided
> metadata) so it knows what is an array. (Specifically PCAP.Packet is the
> array.)
>
> E.g., if you wanted to get the IPSrc and IPDest fields in a table from all
> ICMP packets in this file, that query should be the same for the JSON and
> the XML data.
>
> First question: Does that make sense? I want to make sure I'm
> understanding this right.
>
> Second question, since I don't really understand Drill SQL yet.
>
> What is a query that would pluck the IPSrc.value and IPDest.value from
> this data and make a row of each pair of those?
>
> The top level is a map with a single element named PCAP.
> The "table" is PCAP.Packet which is an array (of maps).
> And within each array item's map the 

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-08-25 Thread Paul Rogers
Great progress, Mike!

First, let's address the schema issue. As you've probably noticed, Drill's
original notion was that data needed no schema: the data itself provides
sufficient syntactic structure to let Drill infer schema. Also as you've
noticed, this assumption turned out to be more marketing enthusiasm than
technical reality. My favorite example is this JSON: {a: 1}{a: 1.2}. Drill
will fail: the first record has `a` as an integer, but the second value is
a float. By the second value, Drill has already allocated an integer
vector. As Charles has heard me say many times, "Drill cannot predict the
future." If the syntax is ambiguous, Drill may not do the right thing.

To address this, the team added the notion of a "provided schema", but
support is quite limited. There are two parts: planning and execution. At
execution time, a storage plugin can make use of a provided schema: the
JSON reader does so. To solve the ambiguous JSON example above, one
provides a schema that insists that `a` is a DOUBLE (say) regardless of
what the JSON text might suggest. Sounds like the XML parser needs to be
modified to make use of the provided schema. Here, again, there are two
parts. The "EVF" will handle the vector side of things. However, the parser
needs to know to expect the indicated type. That is, if a field `x` is
defined as a repeated DOUBLE (i.e. ARRAY), then the XML parser has
to understand that it will see multiple "x" elements one after another, and
to stick the values into the "x" array vector. I believe that Charles said
this functionality is missing.

We then turn to the other aspect of schema: planning. We'd like the planner
to know the schema so it can validate expressions at plan time. Currently,
Drill assumes that items have a "type to be named later", and makes some
light inferences. If you have SELECT a + b AS c, then we know that a, b and
c are numeric types, but we don't know which specific types they might be:
Drill works those out at run time. With a schema, we should assign types at
plan time, though I suppose Drill should work even without that aspect
(except that I'm sure someone will find a corner case that produces a
different result depending on whether we use the proper types at plan time.)

Another aspect is how the execution planner knows to include an included
schema in the JSON sent to the XML reader. My memory here is hazy: we have
something as demonstrated by the JSON reader. Presumably this same
technique can be used for XML. In particular, there is some JSON element
that holds the provided schema, and there is some way in the planner to
populate that element.

Another question is how Drill is made aware of the schema. This is another
area where Drill's features are somewhat incomplete. Originally, there was
a hidden JSON file in the target HDFS folder that could gather Parquet
metadata. Toward the end of the MapR era, the team added a metastore. The
Hive-based readers use the Hive metastore (HMS). Oddly, Drill cannot use
HMS for Drill's own native readers because of the issue mentioned at the
top: Drill was designed to not need a schema.

We can then ask, where would the Daffodil schemas reside? In a directory
somewhere? In a web service? In a "Daffodil metastore"? How does Daffodil
associate a schema with a file? Or, is that something that the user has to
do? The answer to this will determine how to integrate the Daffodil schema
with Drill.

Drill provides the ability to use table functions to provide extra
properties to a reader. Again, the details have become hazy, but the JSON
tests have examples, I believe. So, one solution is to convert the Daffodil
schema to the Drill schema, and have Drill read that file for each query.
Not very satisfying. Better would be to point Drill to the Daffodil schema,
and let Drill do the conversion. I don't believe we have such a mechanism
at present.

The ideal would be a unified concept of schema: a schema reader that
converts the schema to Drill format, and consistent planner and execution
use of that schema. You would just then need to create a "Daffodil schema
provider." Perhaps there is a way to leverage the metastore API to do this?

The final step would be to automate the association of files with Daffodil
schema: some kind of metastore or registry that says that "file foo.xml
uses schema pcap.dfdl" or whatever.

FWIW, Drill has the notion of a "strict" vs. "lenient" schema. A strict
schema says that the file must include only those fields in the schema. A
lenient schema says that the file may have additional fields, and those
fields use Drill's plain old data inference for those extras.

Sounds like the next three steps for you would be:

1. Extend the XML reader to use a provided schema.
2. Extend the XML reader to support arrays, as indicated by the schema.
3. Test the above using the query-time schema mechanism as demonstrated by
the schema-aware JSON reader tests.

Once that works, you can then move onto the plan-time issues.


Re: Drill SQL questions - JSON context

2023-08-18 Thread Paul Rogers
Hi Mike,

Good progress! There are a number of factors to consider. Let's work
through them one by one.

First, try the simplest possible query:

SELECT * FROM 

If you are using the row set mechanism, grab the schema and print it. (My
memory is hazy, but I do believe that there are methods and classes that
will do this for you.) What you should see is the nested structure you
created. The JSON reader has a super-complex parser that will work out
structure and types based on the first value seen. In your example, it
should guess VARCHAR and INT for your data items.

Once you confirm that the JSON parser has correctly interpreted your data,
you can move onto the second question: how SQL works with structured data.
Here we have to realize that SQL wasn't designed for structured data: SQL
only knows how to work with variables projected to the top level. This
leads to a quantum-like result that observing a variable changes its
structure. The JSON reader uses something called the Enhanced Vector
Framework (EVF) to do the projecting at scan time. Let's work out what it
should be doing. (The code is gawd-awful complex, so there is a chance that
something might be broken.)

In your query, you are projecting c.f.g to a top level variable g. Fields c
and f are an array of maps. I can't recall testing this kind of projection,
but I'd expect it to result in the projected variable g being an array or
arrays: what Drill calls a repeated list. Although I wrote this stuff, I
don't recall any code that will convert a repeated map into a repeated
list: so this area may be a bit tender. Or, maybe it just punts and leaves
the repeated map, but with a single entry? That wouldn't quite work. This
could use a bit of testing.

The third question is how to flatten rows. Flattening occurs via a separate
flatten operator. You'd need to flatten twice: once for each level. This
whole area is a bit hazy for me (I'm not super familiar with the details),
but I suspect you'd need to use a set of nested SELECT statements, each of
which flattens the outermost level, which will project the result to the
top level where it can be manipulated by the SELECT at the next outer
level. To try this, extend your SELECT * to select just a top-level field
(a) and flatten a top-level repeated map (f). The result should be rows
with a scalar and a repeated map. Then, add another level of SELECT to
flatten the repeated map: you'll get a scalar and a map. Then, use yet
another SELECT to pick out the map fields to top-level fields, and do the
WHERE clause. I *think* that should more-or-less work.

- Paul

On Fri, Aug 18, 2023 at 2:01 PM Mike Beckerle  wrote:

> I'm using Apache Daffodil in the mode where it outputs JSON data. (For the
> moment, until we build a tighter integration. This is my conceptual test
> framework for that integration.)
>
> I have parsed data to create this JSON which represents 2-level nested
> repeating subrecords.
>
> All the simple fields are int.
>
> [{"a":1,  "b":2,  "c":[{"d":3,  "e":4,  "f":[{"g":5,  "h":6 },
>  {"g":7,  "h":8 }]},
>{"d":9,  "e":10, "f":[{"g":11, "h":12},
>  {"g":13, "h":14}]}]},
>  {"a":21, "b":22, "c":[{"d":23, "e":24, "f":[{"g":25, "h":26 },
>  {"g":27, "h":28 }]},
>{"d":29, "e":30, "f":[{"g":31, "h":32},
>  {"g":33, "h":34}]}]}]
>
> So, the top level is a vector of maps,
> within that, field "c" is a vector of maps,
> and within "c" is a field f which is a vector of maps.
>
> The reason I created this is I'm trying to understand the arrays and how
> they work with Drill SQL.
>
> I'm trying to figure out how to get this rowset of 3 rows from a query, and
> I'm stumped.
>
>   a   b   d   e   g   h
> ( 1,  2,  3,  4,  5,  6)
> ( 1,  2,  9, 10, 13, 14)
> (21, 22, 29, 30, 33, 34)
>
> This is the SQL that is my conceptual framework, but I'm sure it won't
> work.
>
> SELECT a, b, c.d AS d, c.e AS e, c.f.g AS g, c.f.h AS h
> FROM ... the json file...
> WHERE g mod 10 == 3 OR g == 5
>
> But I know it's not going to be that easy to get the query to traverse the
> vector inside the vector.
>
> From the doc, the FLATTEN operator seems to be needed, but I can't really
> figure it out.
>
> This is what all my data is like. Trees of nested vectors of sub-records.
>
> Can anyone advise on what the SQL might look like, or where there's an
> example doing something like this I can learn from?
>
> Thanks for any help
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>


Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-14 Thread Paul Rogers
IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such
as "ARRAY". This works, however, only if the XML reader uses the
(very complex) EVF framework and has a way to control parsing based on the
data type (and to set the data type based on parsing). The JSON reader has
such an integration. Charles, did you do the work to add that kind of
dynamic state machine to the XML parser?

- Paul

On Mon, Aug 14, 2023 at 6:28 PM Charles Givre  wrote:

> Hi Mike,
> It is theoretically possible but I don't have an example of the syntax.
> As you've probably figured out, Drill vectors have both a type and data
> mode.  The mode is either NULLABLE or REPEATED if I remember correctly.
> Thus, you could tell Drill via the inline schema that the data mode for a
> given field is REPEATED and that would be the Drill equivalent of an
> Array.  I've never actually done this, so I don't really know if it would
> work for inline schemata but I'd assume that it would.
>
> I'll do some digging to see whether I have any examples of this.
> Best,
> --C
>
>
>
>
>
> > On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
> >
> > I'm trying to get my Drill SQL queries to produce the right thing from
> XML.
> >
> > A major thing that you can't easily infer from looking at just XML data
> is
> > what is an array. XML lacks an array starting indicator.
> >
> > Is there an inline schema notation in the Drill Query language for
> > array-ness, so that one can inform Drill what is an array?
> >
> > For example this provides simple types for all the fields directly in the
> > query.
> >
> > @Test
> >
> > public void testSimpleProvidedSchema() throws Exception {
> >
> >  String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
> > (type => 'xml', schema " +
> >
> >"=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
> > FLOAT, `double_field` DOUBLE, `boolean_field` " +
> >
> >"BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
> > TIMESTAMP, `string_field`" +
> >
> >" VARCHAR, `date2_field` DATE properties {`drill.format` =
> > `MM/dd/`})'))";
> >
> >  RowSet results = client.queryBuilder().sql(sql).rowSet();
> >
> >  assertEquals(2, results.rowCount());
> >
> >
> > Can one also tell Drill what fields or child elements are arrays?
>
>


Re: UserBitShared.proto question

2023-08-08 Thread Paul Rogers
Unless something changed, Drill's build does not compile the .proto files.
Instead, the files are generated manually, and checked into git, on those
rare occasions that the API changes. I seem to recall that there are some
instructions somewhere, but a quick search didn't reveal anything.

- Paul

On Tue, Aug 8, 2023 at 6:46 AM Mike Beckerle  wrote:

> So is UserBitShared.java generated from UserBitShared.proto ?
>
> It looks like it is, but mvn clean install -DskipTests=true doesn't seem to
> cause it to be regenerated.
>
> What do I do to cause the regeneration?
>
> Right now I've edited both files to add a new ErrorType.SCHEMA, but I think
> I should only have to edit in one spot.
>


Re: drill tests not passing

2023-07-11 Thread Paul Rogers
Hi Mike,

A quick glance at the log suggests a failure in the tests for the JSON
reader, in the Mongo extended types. Drill's date/time support has
historically been fragile. Some tests only work if your machine is set to
use the UTC time zone (or Java is told to pretend that the time is UTC.)
The Mongo types test failure seems to be around a date/time test so maybe
this is the issue?

There are also failures indicating that the Drillbit (Drill server) died.
Not sure how this can happen, as tests run Drill embedded (or used to.)
Looking earlier in the logs, it seems that the Drillbit didn't start due to
UDF (user-defined function) failures:

Found duplicated function in drill-custom-lower.jar:
custom_lower(VARCHAR-REQUIRED)
Found duplicated function in built-in: lower(VARCHAR-REQUIRED)

Not sure how this could occur: it should have failed in all builds.

Also:

File
/opt/drill/exec/java-exec/target/org.apache.drill.exec.udf.dynamic.TestDynamicUDFSupport/home/drill/happy/udf/staging/drill-custom-lower-sources.jar
does not exist on file system file:///

This is complaining that Drill needs the source code (not just class file)
for its built-in functions. Again, this should not fail in a standard
build, because if it did, it would fail in all builds.

There are other odd errors as well.

Perhaps we should ask: is this a "stock" build? Check out Drill and run
tests? Or, have you already started making changes for your project?

- Paul


On Tue, Jul 11, 2023 at 9:07 AM Mike Beckerle  wrote:

>
> I have drill building and running its tests. Some tests fail: [ERROR]
> Tests run: 4366, Failures: 2, Errors: 1, Skipped: 133
>
> I am wondering if there is perhaps some setup step that I missed in the
> instructions.
>
> I have attached the output from the 'mvn clean install -DskipTests=false'
> execution. (zipped)
> I am running on Ubuntu 20.04, definitely have Java 8 setup.
>
> I'm hoping someone can skim it and spot the issue(s).
>
> Thanks for any help
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>
>
>


Re: Drill and Highly Hierarchical Data from Daffodil

2023-07-11 Thread Paul Rogers
Drill can internally handle scalars, arrays (AKA vectors) and maps (AKA
tuples, structs). SQL, however, prefers to work with scalars: there is no
good syntax to reach inside a complex object for, say, a WHERE condition
without also projecting that item as a top-level scalar.

The cool thing, for ML use cases, is that Drill's arrays can also be
structured: a vector of input values each of which is a vector of data
points along with a class label.

That said, if you have a record with a field "obj" that is a map (struct,
object) that contains a field "coord" that is an array of two (or three)
doubles, you can project it as:

SELECT obj.coord FROM something

The value you get back will be an array. Drill's native API handles this
just fine. JDBC does not really speak "vector". So, in that case, you could
project the elements:

SELECT obj.coord[0] AS x, obj.coord[1] AS y FROM something

I find it helpful to first think about how Drill's internal data vectors
will look, then work from there to the SQL that will do what needs doing.

- Paul

On Tue, Jul 11, 2023 at 11:46 AM Charles Givre  wrote:

> HI Mike,
> When you say "you want all of them', can you clarify a bit about what
> you'd want the data to look like?
> Best,
> -- C
>
>
>
> > On Jul 11, 2023, at 12:33 PM, Mike Beckerle 
> wrote:
> >
> > In designing the integration of Apache Daffodil into Drill, I'm trying to
> > figure out how queries would look operating on deeply nested data.
> >
> > Here's an example.
> >
> > This is the path to many geo-location latLong field pairs in some
> > "messageSet" data:
> >
> >
> messageSet/noc_message[*]/message_content/content/vmf/payload/message/K05_17/overlay_message/r1_group/item[*]/points_group/item[*]/latLong
> >
> > This is sort-of like XPath, except in the above I have put "[*]" to
> > indicate the child elements that are vectors. You can see there are 3
> > nested vectors here.
> >
> > Beneath that path are these two fields, which are what I would want out
> of
> > my query, along with some fields from higher up in the nest.
> >
> > entity_latitude_1/degrees
> > entity_longitude_1/degrees
> >
> > The tutorial information here
> >
> >https://drill.apache.org/docs/selecting-nested-data-for-a-column/
> >
> > describes how to index into JSON arrays with specific integer values,
> but I
> > don't want specific integers, I want all values of them.
> >
> > Can someone show me what a hypothetical Drill query would look like that
> > pulls out all the values of this latLong pair?
> >
> > My stab is:
> >
> > SELECT pairs.entity_latitude_1.degrees AS lat,
> > pairs.entity_longitude_1.degrees AS lon FROM
> >
> messageSet.noc_message[*].message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item[*].points_group.item[*].latLong
> > AS pairs
> >
> > I'm not at all sure about the vectors in that though.
> >
> > The other idea was this quasi-notation (that I'm making up on the fly
> here)
> > which treats each vector as a table.
> >
> > SELECT pairs.entity_latitude_1.degrees AS lat,
> > pairs.entity_longitude_1.degrees AS lon FROM
> >  messageSet.noc_message AS messages,
> >
> >
> messages.message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item
> > AS parents
> >  parents.points_group.item AS items
> >  items.latLong AS pairs
> >
> > I have no idea if that makes any sense at all for Drill
> >
> > Any help greatly appreciated.
> >
> > -Mike Beckerle
>
>


[jira] [Created] (DRILL-8375) Incomplete support for non-projected complex vectors

2022-12-24 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8375:
--

 Summary: Incomplete support for non-projected complex vectors
 Key: DRILL-8375
 URL: https://issues.apache.org/jira/browse/DRILL-8375
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


The `ResultSetLoader` implementation supports all of Drill's vector types. 
However, DRILL-8188 discovered holes in support for non-projected vectors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Small dataset query issue and the workaround we found

2022-08-30 Thread Paul Rogers
Hi All,

As others have said, the only difference between plans for “small” and “large” 
queries is the queue size and memory. As I recall, those are spelled out in the 
docs.

Ensure that there is sufficient memory for the slicing up done by the queue, 
and the query. Memory is allocated to sorts, joins and hash aggregations.

- Paul

Sent from my iPhone

> On Aug 30, 2022, at 8:21 AM, James Turton  wrote:
> 
> Hi François
> 
> This is quite intriguing. I believe that one difference in execution 
> conditions between the small and large queues is the amount memory that the 
> query will be allowed. It could be interesting to see if the problem is 
> connected to this difference by setting exec.queue.memory_ratio to 1.0. 
> Enabling debug logging in logback.xml might also reveal more about what 
> differs between these two executions of the same plan.
> 
> Regards
> James
> 
>> On 2022/08/25 18:13, François Méthot wrote:
>> Hi,
>> 
>> I am looking for an explanation to a workaround I have found to an issue
>> that has been bugging my team for the past weeks.
>> 
>> Last June we moved from Drill 1.12 to Drill 1.19... Long overdue upgrade!
>> Not long after we started getting the issue described  below.
>> 
>> We run a query daily on about 410GB of text data spread over  ~2200 files,
>> it has a cost of ~22 Billions queued as Large query
>> When the same query runs on 200MB spread over 130 files  (same data
>> format), with cost of 36 Millions queued as Large query, it never completes.
>> 
>> The small dataset query would stop doing any progress after a few minutes,
>> leaving on for hours, no progress, never complete.
>> 
>> The last running fragment is a HASH_PARTITION_SENDER showing 100% fragment
>> time.
>> 
>> After much shoot in the dark debugging session, analysing the src data etc.
>> 
>> We reviewed our cluster configuration,
>> when we changed
>>  exec.queue.threshold=3000 to 6000
>> to categorize our 200MB dataset as a small query,
>> 
>> The small dataset query started to work consistently in less than 10
>> seconds.
>> 
>> The physical plan is identical whether the query is Large or Small.
>> 
>> Is there a difference internally in drill execution whether the query is
>> small or large?
>> Would you be able to provide an explanation why this workaround works?
>> 
>> Cluster detail:
>> 8 drillbits running in kubernetes
>> 16GB direct mem
>> 4GB Heap
>> data stored on a very efficient nfs, exposed as k8s pv/pvc to drillbit pods.
>> 
>> Thanks for any insight you can provide on that setting or in regards to our
>> initial problem.
>> François
>> 
> 


Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Paul Rogers
Hi Luoc,

First, what poor soul is asked to deal with large amounts of XML in this
day and age? I thought we were past the XML madness, except in Maven and
Hadoop config files.

XML is much like JSON, only worse. JSON at least has well-defined types
that can be gleaned from JSON syntax. With XML...? Anything goes because
XML is a document mark-up language, not a data structure description
language.

The classic problem with XML is that if XML is used to describe a
reasonable data structure (rows and columns), then it can reasonably be
parsed into rows and columns. If XML represents a document (or a
relationship graph), then there is no good mapping to rows and columns.
This was true 20 years ago and it is true today.

So, suppose your XML represents row-like data. Then an XML parser could
hope for the best and make a good guess at the types and structure. The XML
parser could work like the new & improved JSON parser (based on EVF2) which
Vitalii is working on. (I did the original work and Vitalli has the
thankless task of updating that work to match the current code.) That JSON
parser is VERY complex as it infers types on the fly. Quick, what type is
"a" in [{"a": null}, {"a": null}, {"a": []}]. We don't know. Only when
{"a": [10]} appears can we say, "Oh! All those "a" were REPEATED INTs!"

An XML parser could use the same tricks. In fact, it can probably use the
same code. In JSON, the parser sends events, and the Drill code does its
type inference magic based on those events. An XML parser can emit similar
events, and make similar decisions.

As you noted, if we have a DTD, we don't have to do schema inference. But,
we do have to do DTD-to-rows-and-columns inference. Once do that, we use
the provided schema as you suggested. (The JSON reader I mentioned already
supports a provided schema to add sanity to the otherwise crazy JSON type
inference process when data is sparse and changing.)

In fact, if you convert XML to JSON, then the XML-to-JSON parser has to
make those same decisions. Hopefully someone has already done that and
users would be willing to use that fancy tool to convert their XML to JSON
before using Drill. (Of course, if they want good performance, they should
have converted XML to Parquet instead.)

So, rather than have a super-fancy Drill XML reader, maybe find a
super-fancy XML-to-Parquet converter, use that once, and then let Drill
quickly query Parquet. The results will be much better than trying to parse
XML over and over on each query. Just because we *can* do it doesn't mean
we *should*.

Thanks,

- Paul



On Wed, Apr 6, 2022 at 5:01 AM luoc  wrote:

> 
> Hello dear driller,
>
> Before starting the topic, I would like to do a simple survey :
>
> 1. Did you know that Drill already supports XML format?
>
> 2. If yes, what is the maximum size for the XML files you normally read? 1MB,
> 10MB or 100MB
>
> 3. Do you expect that reading XML will be as easy as JSON (Schema
> Discovery)?
>
> Thank you for responding to those questions.
>
> XML is different from the JSON file, and if we rely solely on the Drill
> drive to deduce the structure of the data. (or called *SCHEMA*), the code
> will get very complex and delicate.
>
> For example, inferring array structure and numeric range. So, "provided
> schema" or "TO_JSON" may be good medicine :
>
> *Provided Schema*
>
> We can add the DTD or XML Schema (XSD) support for the XML. It can build
> all value vectors (Writer) before reading data, solving the fields, types,
> and complex nested.
>
> However, a definition file is actually a rule validator that allows
> elements to appear 0 or more times. As a result, it is not possible to know
> if all elements exist until the data is read.
>
> Therefore, avoid creating a large number of value vectors that do not
> actually exist before reading the data.
>
> We can build the top schema at the initial stage and add new value vectors
> as needed during the reading phase.
>
> *TO_JSON*
>
> Read and convert XML directly to JSON, using the JSON Reader for data
> resolution.
>
> It makes it easier for us to query the XML data such as JSON, but requires
> reading the whole XML file in memory.
>
> I think the two can be done, so I look forward to your spirited discussion.
>
> Thanks.
>
> - luoc
>


[jira] [Created] (DRILL-8185) EVF 2 doen't handle map arrays or nested maps

2022-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8185:
--

 Summary: EVF 2 doen't handle map arrays or nested maps
 Key: DRILL-8185
 URL: https://issues.apache.org/jira/browse/DRILL-8185
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.20.0
Reporter: Paul Rogers
Assignee: Paul Rogers


When converting Avro, Luoc found two bugs in how EVF 2 (the projection 
mechanism) handles map array and nested maps



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [VOTE] Adopt the Drill Test Framework from MapR

2022-03-17 Thread Paul Rogers
Abhishek used to have that thing running like a charm. Great to see it
getting attention again.

+1

- Paul

On Thu, Mar 17, 2022 at 2:03 AM James Turton  wrote:

> Hi dev community!
>
> Many of you need no introduction to the test framework developed by MapR
>
> https://github.com/mapr/drill-test-framework
>
> . For those who don't know, the test framework contains around 10k tests
> often exercising scenarios not covered by Drill's unit tests. Just weeks
> ago it revealed a regression in a Drill 1.20 RC and saved us from
> shipping with that bug. The linked repository has been dormant for going
> on two years but I am aware of bits of work that have been done on the
> test framework since, and today Anton is actively dusting off and
> updating it. Since the codebase is under the Apache 2.0 license, we are
> free to bring a copy into the Drill project. I've created a new
> (currently empty) possible home for the test framework at
>
> https://github.com/apache/drill-test-framework
>
> Before I proceed to push a clone there, please vote if you support or
> oppose our adoption of the test framework.
>
> P.S. I have also sent a message to a contact at HPE just in case they
> might be aware of some concern applicable to our copying this repo but,
> given the license applied, I cannot see that there will be be one.
> Should anything get raised (and we'd decided to proceed) I would, of
> course, pause so that we can discuss.
>
> Regards
> James
>


[jira] [Created] (DRILL-8159) Upgrade HTTPD, Text readers to use EVF3

2022-03-06 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8159:
--

 Summary: Upgrade HTTPD, Text readers to use EVF3
 Key: DRILL-8159
 URL: https://issues.apache.org/jira/browse/DRILL-8159
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Paul Rogers
Assignee: Paul Rogers


Continuation of work originally in the DRILL-8085 PR.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] Some ideas for Drill 1.21

2022-02-09 Thread Paul Rogers
at is a vote of confidence for the plan. Furthermore, depending
> on how this is implemented the chance of a false positive change detection
> may increase with very long histories which gives a desirable side effect
> of occasional replanning.
>
> What do other people think about this?
>
>
>
>
> On Tue, Feb 8, 2022 at 4:48 AM James Turton  wrote:
>
> > My 2c while I wait for 1.20.0 RC1 to upload.
> >
> > I think it's good that we continue to bring every design decision out
> > here to the community like Charles did with this one.  Some relevant
> > excerpts I turned up while zipping around a few ASF docs now.
> >
> > "Mailing lists are the virtual rooms where ASF communities live,
> > form and grow. All formal decisions the project's PMC makes need to
> > have an email thread (possibly with a recorded vote) as an audit
> > trail that this was an official decision." [1]
> >
> > "We firmly believe in hats
> > <https://www.apache.org/foundation/how-it-works.html#hats>. Your
> > role at the ASF is one assigned to you personally, and is bestowed
> > on you by your peers. It is not tied to your job or current employer
> > or company." [2]
> >
> > "Unless they specifically state otherwise, whatever an ASF
> > participant posts on any mailing list is done /as themselves/. It is
> > the individual point-of-view, wearing their personal hat and not as
> > a mouthpiece for whatever company happens to be signing their
> > paychecks right now, and not even as a director of the ASF." [2]
> >
> >
> > Info schema.  Info schema is slow when the set of enabled storage
> > plugins is slow to register schemas.  Flaky plugins can be so slow to do
> > this as to make info schema appear broken.  Info schema recently had its
> > filter push down improved so that unneeded schema registration is
> > avoidable [3], and I tested it working in the case of an unreachable
> > active PostgreSQL plugin (provided my WHERE clause excluded said pg).
> >
> > In my opinion making today's "on-demand" info schema, which re-fetches
> > schema metadata from sources whenever a query requests it, more
> > efficient is the right place to start.  Rewriting it on EVF2 would, I
> > understand, gain it limit push down support for free, though filter push
> > down seems more likely to be helpful on this kind of data to me.  There
> > is also no reason I can see for info schema not to fetch schema metadata
> > from plugins concurrently.  I don't know if this would be best achieved
> > by explicit programming of the concurrency, or by making the info schema
> > look "splittable" to Drill so that multiple fragments get created.
> >
> > Lastly, I'm generally against introducing any sort of results caching,
> > data or metadata, except in special circumstances such as when the
> > planner can be certain that the underlying data has not changed (seldom
> > or never the case for Drill because it doesn't control its own storage
> > layer).  I think that databases, reliable ones anyway, tend to shun
> > results caching and push it to the application layer, since only that
> > layer can decide what kind of staleness is acceptable, but correct me if
> > I'm wrong.  My conclusion here is that I'd rather do this last, and only
> > after careful consideration.
> >
> > [1] https://infra.apache.org/mailing-list-moderation.html
> > [2] https://www.apache.org/foundation/how-it-works.html#management
> > [3] https://github.com/apache/drill/pull/2388
> >
> > On 2022/02/07 21:05, Ted Dunning wrote:
> > > Another option is to store metadata as data in a distributed data
> store.
> > > For static resources, that can scale very well. For highly dynamic
> > > resources like conventional databases behind JDBC connections, you can
> > > generally delegate metadata to that layer. Performance for delegated
> > > metadata won't necessarily be great, but those systems are usually
> either
> > > small (like Postgress or mySQL) or fading away (like Hive).
> > >
> > > Focusing metadata and planning to a single node will make query
> > concurrency
> > > much worse (and it's already not good).
> > >
> > >
> > > On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers  wrote:
> > >
> > >> Hi All,
> > >>
> > >> Drill, like all open source projects, exists to serve those that use
> > it. To
> > >> that end, the best contributions come when some company ne

Re: [DISCUSS] Some ideas for Drill 1.21

2022-02-06 Thread Paul Rogers
Hi All,

Drill, like all open source projects, exists to serve those that use it. To
that end, the best contributions come when some company needs a feature
badly enough that it is worth the effort to develop and contribute a
solution. That's pretty standard, as along as the contribution is general
purpose. In fact, I hope everyone using Drill in support of their company
will contribute enhancements back to Drill. If you maintain your own
private fork, you're not helping the community that provided you with the
bulk of the code.

For the info schema, I'm at a loss to guess why this would be slow, unless
every plugin is going off and scanning some external source. Knowing that
we have a dozen plugins is not slow. Looking at plugin configs is not slow.
What could be slow is if you want to know about every possible file in HDFS
or S3, every database and table in an external DB, etc. In this case, the
bottleneck is either the external system, or the act of querying a dozen
different external systems. Perhap, Charles, you can elaborate on the
specific scenario you have in mind.

Depending on the core issues, there are various solutions. One solution is
to cache all the external metadata in Drill. That's what Impala did with
the Hive Metastore, and it was a mess. I don't expect Drill would do any
better a job. One reason it was a mess is that, in a production system,
there is a vast amount of metadata. You end up playing all manner of tricks
to try to compress it. Since Drill (and Impala) are fully symmetric, each
node has to hold the entire cache. That is memory that can't be used to run
queries. So, to gain performance (for metadata) you give up performance (at
run time.)

One solution is to create a separate metadata cache node. The query goes to
some Drillbit that acts as Foreman. The Foreman plans the query and
retrieves the needed metadata from the metadata node. The challenge here is
that there will be a large amount of metadata transferred, and the next
thing we know we'll want to cache it in each Drillbit, putting us back
where we started.

So, one can go another step: shift all query planning to the metadata node
and have a single planner node. The user connects to any Drillbit as
Foreman, but that Foreman first talks to the "planner/metadata" node to
give it SQL and get back a plan. The Foreman then runs the plan as usual.
(The Foreman runs the root fragment of the plan, which can be compute
intensive, so we don't want the planner node to also act as the Foreman.)
The notion here is that the SQL in/plan out is much smaller than the
metadata that is needed to compute the plan.

The idea about metadata has long been that Drill should provide a metadata
API. The Drill metastore should be seen as just one of many metadata
implementations. The Drill metastore is a "starter solution" for those who
have not already invested in another solution. (Many shops have HMS or
Amazon Glue, which is Amazon's version of HMS, or one of the newer
metadata/catalog solutions.)

One can go even further. Consider file and directory pruning in HMS. Every
tool has to do the exact same thing: given a set of predicates, find the
directories and files that match. Impala does it. Spark must do it.
Preso/Trino probably does it. Drill, when operating in Hive/HMS mode must
do it. Maybe someone has come with the One True Metadata Pruner and Drill
can just delegate the task to that external tool, and get back the list of
directories and files to scan. Far better than building yet another pruner.
(I think Drill currently has two Parquet metadata pruners, duplicating what
many other tools have done.)

If we see the source of metadata as plugable, then a shop such as DDR that
has specific needs (maybe caching those external schemas), can build a
metadata plugin for that use case. If the solution is general, it can be
contributed to Drill as another metadata option.

In any case, if we can better understand the specific problem you are
encountering, we can perhaps offer more specific suggestions.

Thanks,

- Paul

On Sun, Feb 6, 2022 at 8:11 AM Charles Givre  wrote:

> Hi Luoc,
> Thanks for your concern.  Apache projects are often backed unofficially by
> a company.  Drill was, for years, backed my MapR as evident by all the MapR
> unique code that is still in the Drill codebase. However, since MapR's
> acquisition, I think it is safe to say that Drill really has become a
> community-driven project.  While some of the committers are colleagues of
> mine at DataDistillr, and Drill is a core part of DataDisitllr, from our
> perspective, we've really just been focusing on making Drill better for
> everyone as well as building the community of Drill users, regardless of
> whether they use DataDistillr or not.  We haven't rejected any PRs because
> they go against our business model or tried to steer Drill against the
> community or anything like that.
>
> Just for your awareness, there are other OSS projects, including some
> Apache projects 

[jira] [Created] (DRILL-8124) Fix implicit file issue with EVF 2

2022-02-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8124:
--

 Summary: Fix implicit file issue with EVF 2
 Key: DRILL-8124
 URL: https://issues.apache.org/jira/browse/DRILL-8124
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Paul Rogers
Assignee: Paul Rogers


Unit testing with EVF 2 found an issue in the handling of implicit columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8123) Revise scan limit pushdown

2022-02-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8123:
--

 Summary: Revise scan limit pushdown
 Key: DRILL-8123
 URL: https://issues.apache.org/jira/browse/DRILL-8123
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Paul Rogers
Assignee: Paul Rogers


Recent work added a push down of the limit into a scan. The work had a few 
holes, one of which was plugged by the recent update of EVF to manage the 
limit. Another hole is that the physical plan uses a value of 0 to indicate no 
limit, but 0 is a perfectly valid limit, it means "no data, only schema." The 
field name is "maxRecords", but should be "limit" to indicate the purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8115) LIMIT pushdown into EVF

2022-01-28 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8115:
--

 Summary: LIMIT pushdown into EVF
 Key: DRILL-8115
 URL: https://issues.apache.org/jira/browse/DRILL-8115
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Paul Rogers
Assignee: Paul Rogers


Add LIMIT support to the scan framework and EVF so that plugins don't have to 
implement it themselves.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [ANNOUNCE] James Turton as PMC Member

2022-01-24 Thread Paul Rogers
Congratulations James!

- Paul

On Mon, Jan 24, 2022 at 9:34 AM Charles Givre 
wrote:

> The Project Management Committee (PMC) for Apache Drill is pleased to
> announce that we have invited James Turton to join us as a PMC member of
> the Drill project and he has accepted.  Please join me in congratulating
> James and welcoming him to the PMC!
>
>
> Best,
> Charles Givre
> PMC Chair, Apache Drill
>
>
>
>
> Charles Givre
> Founder, CEO DataDistillr
> Email:  char...@datadistillr.com
> Phone:  + 443-762-3286
> Book a Meeting 30 min  • 60
> min 
> LinkedIn @cgivre 
> GitHub @cgivre 
>  
>


Re: [ANNOUNCE] New Committer: PJ Fanning

2022-01-24 Thread Paul Rogers
Congratulations!

- Paul

On Mon, Jan 24, 2022 at 9:15 AM Charles Givre  wrote:

> The Project Management Committee (PMC) for Apache Drill is pleased to
> announce that we have invited PJ Fanning to join us as a committer to the
> Drill project.  PJ is a committer and PMC member for the Apache POI project
> and author of the Excel Streaming library which Drill uses for the Excel
> reader.  He has contributed numerous fixes and assistance to Drill relating
> to the Drill's Excel reader.  Please join me in congratulating PJ and
> welcoming him as a committer!
>
> Best,
> Charles Givre
> PMC Chair, Apache Drill
>
>


Re: [DISCUSS] Lombok - friend or foe?

2022-01-24 Thread Paul Rogers
A quick check of the source suggests that the Easy Format config builder
(which is a nice addition) does not use Lomboc. Someone coded up (or had
their IDE code up) the setters one-by-one. Makes sense, Lombok isn't for
the builder pattern.

Note that, allowing Lomboc in any part of Drill is the same as allowing it
everywhere. The old CS thing that the only numbers that matter are 0, 1 and
infinity. To do a PR, all tests should pass, which means that the IDE needs
to be able to debug any that have problems. If any plugin uses Lomboc, then
developers have to wrestle with it. (But, what is a plugin doing with data
objects?)

So, perhaps remove it entirely for now. It can be added back for extensions
when those extensions are separate projects. (Though, adding that
dependency on one extension adds it for everyone. Will there be Lomboc
version conflicts? Should we wait for class loader isolation before
allowing it back?)

In general, Drill is so large that it should not take on more dependencies
unless they are a huge win. This is a reason to move the obscure plugins
out of the core: mucking with distributed systems should also require one
to muck with Excel.

- Paul

On Sun, Jan 23, 2022 at 11:59 PM James Turton  wrote:

> I'll prepare a PR that unlomboks everything except contrib.  Since we're
> talking about contrib splitting off into one or many independent code
> bases (c.f. install "Drill 2 and plug-in organisation"), working to make
> it conform to coding standards that we're selecting for core Drill
> probably won't pay.
>
> On 2022/01/23 01:36, Charles Givre wrote:
> > I guess the question is do we de-lombok what has already been done?  I
> really like the builders for plugin configs, but I'm generally in agreement
> that if it is causing problems building, we should ditch it.
> > Best,
> > -- C
> >
> >
> >
> >> On Jan 22, 2022, at 5:02 PM, Ted Dunning  wrote:
> >>
> >> The Lombok story is better in Intellij, possibly because the Lombok devs
> >> use IntelliJ like the majority of devs. Once I knew to install the
> plugin,
> >> things were at least comprehensible.
> >>
> >> But the problem is that it isn't obvious. As a newcomer, you don't know
> >> what you don't know and because Lombok's major effect is code that isn't
> >> there, it isn't obvious where to look.
> >>
> >> The point about it not helping that much due to Drill's design (good
> point,
> >> paul) is apposite, but I think the naive reader issue is even bigger.
> >>
> >> Net, as a person who isn't developing anything for Drill just lately, I
> >> don't think it's a good idea at all.
> >>
> >>
> >>
> >> On Sat, Jan 22, 2022 at 6:37 AM luoc  wrote:
> >>
> >>> Hi all,
> >>>
> >>> I have a story here. In Oct 2021, I upgraded Eclipse to the latest
> release
> >>> (2021–09) and then found out that the Lombok dependency was added Drill
> >>> repository, So I installed Lombok (as a new plugin) from Eclipse
> >>> Marketplace as I used to. Finally, restarted the IDE and prepared to
> open
> >>> the Drill project, but it is crushed cause by the issue #2956 <
> >>> https://github.com/projectlombok/lombok/issues/2956>, Lombok was not
> >>> available until I looked at a temporary solution..
> >>>
> >>> I use both Eclipse and IDEA, but I use Eclipse more often. I have no
> >>> objection to the use of Lombok, but suggest the following three points
> :
> >>>
> >>> 1. Could we use Lombok only in `drill-contrib` module?
> >>>
> >>> 2. Could we agree not to use Lombok in common module?
> >>>
> >>> 3. It is best to update the dev documentation to describe this results
> if
> >>> we continue to use Lombok.
> >>>
> >>> In fact, I have the same idea as Paul, more about balancing choices.
> >>>
> >>> Thanks.
> >>>
> >>>> 2022年1月22日 下午5:34,Paul Rogers  写道:
> >>>>
> >>>> Hi All,
> >>>>
> >>>> I look at any tool as a cost/benefit tradeoff. If Drill were a typical
> >>>> business app, with lots of "data objects", then the hassle of Lomboc
> >>> might
> >>>> be a net win. However, the nature of Drill is that we have very few
> data
> >>>> objects. We have lots of Protobuf objects, or Jackson-serialized
> objects,
> >>>> but not too many data objects of the kind used with object-relational
> >>>> mappers.
> >>>>
> >&

Re: [DISCUSS] Lombok - friend or foe?

2022-01-22 Thread Paul Rogers
Hi All,

I look at any tool as a cost/benefit tradeoff. If Drill were a typical
business app, with lots of "data objects", then the hassle of Lomboc might
be a net win. However, the nature of Drill is that we have very few data
objects. We have lots of Protobuf objects, or Jackson-serialized objects,
but not too many data objects of the kind used with object-relational
mappers.

On the other hand, I had to spend an hour or so trying to figure out why
things would not build in Eclipse. Then, more time to figure out how to
install the half-finished Lomboc plugin for Eclipse and various other
fiddling.

So, I'd guess, on balance, Lombok has cost, and will continue to cost, more
time than it saved avoiding a few getter/setter methods. And, I agree with
Ted, Eclipse (and, I assume IntelliJ), is pretty quick at generating those
methods.

Since Lomboc has a cost, and is not a huge win, KISS suggests we avoid
adding extra dependencies unnecessarily.

That's my 2 cents...

- Paul



On Fri, Jan 21, 2022 at 8:51 AM Ted Dunning  wrote:

> A couple of years ago, I had a dev introduce Lombok into some code without
> me knowing. That let me be a classic naive user.
>
> The result was total confusion on my part. Sooo much code was being
> automagically generated that I couldn't figure out the code and spent a lot
> of time chasing my tail and very little time looking at the crux of the
> code.
>
> My own personal preference is either
>
> - use a language like Julia if you want magic. It's fantastic and all to
> have amazing stuff and coders expect to see it.
>
> - use an IDE to generate the boiler plate and put it into its own little
> annex in the code with the interesting bits near the top of classes. That
> lets debuggers and IDEs that don't understand Lombok to function without
> impairing readability much. Concurrent with that, use discipline to not do
> strange things like changing the expected meaning of the boilerplate.
>
> That's my preference, but I wouldn't want to push that preference very
> hard. My own prioritization is on readability of the code by outsiders.
>
>
>
>
> On Fri, Jan 21, 2022 at 2:25 AM James Turton  wrote:
>
> > Hi again Devs
> >
> > This one is simple to describe.  Lombok entered the Drill code base this
> > year, but not everyone feels that Lombok is appropriate for every code
> > base.  To my, fairly limited, understanding the advantage of Lombok is
> > that boilerplate code is reduced while the disadvantage is the
> > deployment of code generation magic that can have untoward effects on
> > build-time tools and IDEs.
> >
> > So here is a chance to opine on Lombok if you'd like to.  My own opinion
> > is very near neutral and goes something like "It burned me a bit once,
> > but hasn't since, and less boilerplate is nice.  I guess it can stay
> > .  I hope I don't regret this one day."
> >
> > Regards
> > James
> >
>


Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-18 Thread Paul Rogers
Hi James,

My experience might be a bit old. I seem to recall, way back when, we did
try to build some plugins outside of Drill itself and that there were
issues. Maybe it was just the inconvenience of debugging? Perhaps the test
libraries were not available? Development is fastest when you can write a
unit test that fires up Drill, and exercises your plugin. You can then step
through the code, see an error, fix it, and try again in a matter of
seconds. Without that, you have to rebuild your jar, copy it to Drill,
restart Drill, submit a query, and hope to figure out what is wrong when
things blow up.

So, I wonder if we also publish test jars? If not, that would be a big help.

UDFs also have issues since Drill doesn't actually run your code: Drill
copies it. And, unless you know about the magic thingie, Drill won't even
load your UDF. (Have to tell Drill not to load from cache, if I recall.)

To test all this out, just build a demo plugin and demo UDF using the
libraries. If it is smooth sailing, we're good to go. If not, figure out
what's missing and fix it.

Oh, and another issue: class loader isolation. As Drill includes ever more
plugins, dependencies will conflict. That's why Presto/Trino loads plugins
in a separate class loader: Trino may use version 5 of library X, but I
might use 7. With class loader isolation, stuff just works. Without it, one
lives in Maven dependency hell for a while.

Thanks,

- Paul


On Tue, Jan 18, 2022 at 12:29 AM James Turton  wrote:

> For my part, I'd forgotten that GitHub does give users the opportunity
> to attach binary distributables to releases.  So my first thought of
> "GitHub would mean using Git repositories to host Jar files" was off the
> mark.
>
> Paul, setting aside the hosting and distribution for a moment, may I ask
> about the statement "ensure plugins can be built outside of the Drill
> repo"?  Released versions of Drill's own libs are already published to
> Maven.  E.g.
>
>
> https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.19.0
>
> Can a plugin writer not create a new project which lists the required
> Drill libs in its pom.xml deps and proceed to build a plugin away from
> the main tree?  Interactive debugging without the Drill main tree should
> even be possible by attaching a debugger to a running embedded Drill
> with the storage plugin deployed to it, or am I wrong here?
>
> On 2022/01/18 00:32, Paul Rogers wrote:
> > Hi Ted,
> >
> > Thanks for the explanation, makes sense.
> >
> > Ideally, the client side would be somewhat agnostic about the repo it
> pulls
> > from. In a corporate setting, it should pull from the "JFrog Repository"
> > that everyone seems to use (but which I know basically nothing.) Oh,
> lord,
> > a plugin architecture for the repo for the plugin architecture?
> >
> > - Paul
> >
> > On Mon, Jan 17, 2022 at 1:46 PM Ted Dunning 
> wrote:
> >
> >> Paul,
> >>
> >> I understood your suggestion.  My point is that publishing to Maven
> >> central is a bit of a pain while publishing by posting to Github is
> nearly
> >> painless.  In particular, because Github inherently produces a
> relatively
> >> difficult to fake hash for each commit, referring to a dependency using
> >> that hash is relatively safe which saves a lot of agony regarding keys
> and
> >> trust.
> >>
> >> Further, Github or any comparable service provides the same "already
> >> exists" benefit as does Maven.
> >>
> >>
> >>
> >> On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers  wrote:
> >>
> >>> Hi Ted,
> >>>
> >>> Well said. Just to be clear, I wasn't suggesting that we use
> >>> Maven-the-build-tool to distribute plugins. Rather, I was simply
> observing
> >>> that building a global repo is a bit of a project and asked, "what
> could we
> >>> use that already exists?" The Python repo? No. The
> Ubuntu/RedHat/whatever
> >>> Linux repos? Maybe. Maven's repo? Why not?
> >>>
> >>> The idea would be that Drill might have a tool that says, "install the
> >>> FooBlaster" plugin. It downloads from a repo (Maven central, say) and
> puts
> >>> the plugin in the proper plugins directory. In a cluster, either it
> does
> >>> that on every node, or the work is done as part of preparing a Docker
> >>> container which is then pushed to every node.
> >>>
> >>> The key thought is just to make the problem simpler by avoiding the
> need
> >>> to create and maintain a Drill-specific repo when we can

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-17 Thread Paul Rogers
Hi Ted,

Thanks for the explanation, makes sense.

Ideally, the client side would be somewhat agnostic about the repo it pulls
from. In a corporate setting, it should pull from the "JFrog Repository"
that everyone seems to use (but which I know basically nothing.) Oh, lord,
a plugin architecture for the repo for the plugin architecture?

- Paul

On Mon, Jan 17, 2022 at 1:46 PM Ted Dunning  wrote:

>
> Paul,
>
> I understood your suggestion.  My point is that publishing to Maven
> central is a bit of a pain while publishing by posting to Github is nearly
> painless.  In particular, because Github inherently produces a relatively
> difficult to fake hash for each commit, referring to a dependency using
> that hash is relatively safe which saves a lot of agony regarding keys and
> trust.
>
> Further, Github or any comparable service provides the same "already
> exists" benefit as does Maven.
>
>
>
> On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers  wrote:
>
>> Hi Ted,
>>
>> Well said. Just to be clear, I wasn't suggesting that we use
>> Maven-the-build-tool to distribute plugins. Rather, I was simply observing
>> that building a global repo is a bit of a project and asked, "what could we
>> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
>> Linux repos? Maybe. Maven's repo? Why not?
>>
>> The idea would be that Drill might have a tool that says, "install the
>> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
>> the plugin in the proper plugins directory. In a cluster, either it does
>> that on every node, or the work is done as part of preparing a Docker
>> container which is then pushed to every node.
>>
>> The key thought is just to make the problem simpler by avoiding the need
>> to create and maintain a Drill-specific repo when we can barely have enough
>> resources to keep Drill itself afloat.
>>
>> None of this can happen, however, unless we clean up the plugin APIs and
>> ensure plugins can be built outside of the Drill repo. (That means, say,
>> that Drill needs an API library that resides in Maven.)
>>
>> There are probably many ways this has been done. Anyone know of any good
>> examples we can learn from?
>>
>> Thanks,
>>
>> - Paul
>>
>>
>> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning 
>> wrote:
>>
>>>
>>> I don't think that Maven is a forced move just because Drill is in Java.
>>> It may be a good move, but it isn't a forgone conclusion. For one thing,
>>> the conventions that Maven uses are pretty hard-wired and it may be
>>> difficult to have a reliable deny-list of known problematic plugins.
>>> Publishing to Maven is more of a pain than simply pushing to github.
>>>
>>> The usability here is paramount both for the ultimate Drill user, but
>>> also for the writer of plugins.
>>>
>>>
>>>
>>> On Mon, Jan 17, 2022 at 5:06 AM James Turton  wrote:
>>>
>>>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>>>> is probably better fit than GitHub for distribution?  If Drillbits can
>>>> write to their jars/3rdparty directory then I can imagine Drill gaining
>>>> the ability to fetch and install plugins itself without too much
>>>> trouble, at least for Drill clusters with Internet access.
>>>> "Sideloading" by downloading from Maven and copying manually would
>>>> always remain possible.
>>>>
>>>> @Paul I'll try to get a little time with you to get some ideas about
>>>> designing a plugin API.
>>>>
>>>> On 2022/01/14 23:20, Paul Rogers wrote:
>>>> > Hi All,
>>>> >
>>>> > James raises an important issue, I've noticed that it used to be easy
>>>> to
>>>> > build and test Drill, now it is a struggle, because of the many odd
>>>> > external dependencies we have introduced. That acts as a big damper on
>>>> > contributions: none of us get paid enough to spend more time fighting
>>>> > builds than developing the code...
>>>> >
>>>> > Ted is right that we need a good way to install plugins. There are two
>>>> > parts. Ted is talking about the high-level part: make it easy to
>>>> point to
>>>> > some repo and use the plugin. Since Drill is Java, the Maven repo
>>>> could be
>>>> > a good mechanism. In-house stuff is often in an internal repo that
>>&g

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-17 Thread Paul Rogers
Hi Ted,

Well said. Just to be clear, I wasn't suggesting that we use
Maven-the-build-tool to distribute plugins. Rather, I was simply observing
that building a global repo is a bit of a project and asked, "what could we
use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
Linux repos? Maybe. Maven's repo? Why not?

The idea would be that Drill might have a tool that says, "install the
FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
the plugin in the proper plugins directory. In a cluster, either it does
that on every node, or the work is done as part of preparing a Docker
container which is then pushed to every node.

The key thought is just to make the problem simpler by avoiding the need to
create and maintain a Drill-specific repo when we can barely have enough
resources to keep Drill itself afloat.

None of this can happen, however, unless we clean up the plugin APIs and
ensure plugins can be built outside of the Drill repo. (That means, say,
that Drill needs an API library that resides in Maven.)

There are probably many ways this has been done. Anyone know of any good
examples we can learn from?

Thanks,

- Paul


On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning  wrote:

>
> I don't think that Maven is a forced move just because Drill is in Java.
> It may be a good move, but it isn't a forgone conclusion. For one thing,
> the conventions that Maven uses are pretty hard-wired and it may be
> difficult to have a reliable deny-list of known problematic plugins.
> Publishing to Maven is more of a pain than simply pushing to github.
>
> The usability here is paramount both for the ultimate Drill user, but also
> for the writer of plugins.
>
>
>
> On Mon, Jan 17, 2022 at 5:06 AM James Turton  wrote:
>
>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>> is probably better fit than GitHub for distribution?  If Drillbits can
>> write to their jars/3rdparty directory then I can imagine Drill gaining
>> the ability to fetch and install plugins itself without too much
>> trouble, at least for Drill clusters with Internet access.
>> "Sideloading" by downloading from Maven and copying manually would
>> always remain possible.
>>
>> @Paul I'll try to get a little time with you to get some ideas about
>> designing a plugin API.
>>
>> On 2022/01/14 23:20, Paul Rogers wrote:
>> > Hi All,
>> >
>> > James raises an important issue, I've noticed that it used to be easy to
>> > build and test Drill, now it is a struggle, because of the many odd
>> > external dependencies we have introduced. That acts as a big damper on
>> > contributions: none of us get paid enough to spend more time fighting
>> > builds than developing the code...
>> >
>> > Ted is right that we need a good way to install plugins. There are two
>> > parts. Ted is talking about the high-level part: make it easy to point
>> to
>> > some repo and use the plugin. Since Drill is Java, the Maven repo could
>> be
>> > a good mechanism. In-house stuff is often in an internal repo that does
>> > whatever Maven needs.
>> >
>> > The reason that plugins are in the Drill project now is that Drill's
>> "API"
>> > is all of Drill. Plugins can (and some do) access all of Drill though
>> the
>> > fragment context. The API to Calcite and other parts of Drill are wide,
>> and
>> > tend to be tightly coupled with Drill internals. By contrast, other
>> tools,
>> > such as Presto/Trino, have defined very clean APIs that extensions use.
>> In
>> > Druid, everything is integrated via Google Guice and an extension can
>> > replace any part of Druid (though, I'm not convinced that's actually a
>> good
>> > idea.) I'm sure there are others we can learn from.
>> >
>> > So, we need to define a plugin API for Drill. I started down that route
>> a
>> > while back: the first step was to refactor the plugin registry so it is
>> > ready for extensions. The idea was to use the same mechanism for all
>> kinds
>> > of extensions (security, UDFs, metastore, etc.) The next step was to
>> build
>> > something that roughly followed Presto, but that kind of stalled out.
>> >
>> > In terms of ordering, we'd first need to define the plugin API. Then, we
>> > can shift plugins to use that. Once that is done, we can move plugins to
>> > separate projects. (The metastore implementation can also move, if we
>> > want.) Finally, figure out a solution for Ted's suggestion to make it
>> easy
>> > 

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-14 Thread Paul Rogers
Hi All,

James raises an important issue, I've noticed that it used to be easy to
build and test Drill, now it is a struggle, because of the many odd
external dependencies we have introduced. That acts as a big damper on
contributions: none of us get paid enough to spend more time fighting
builds than developing the code...

Ted is right that we need a good way to install plugins. There are two
parts. Ted is talking about the high-level part: make it easy to point to
some repo and use the plugin. Since Drill is Java, the Maven repo could be
a good mechanism. In-house stuff is often in an internal repo that does
whatever Maven needs.

The reason that plugins are in the Drill project now is that Drill's "API"
is all of Drill. Plugins can (and some do) access all of Drill though the
fragment context. The API to Calcite and other parts of Drill are wide, and
tend to be tightly coupled with Drill internals. By contrast, other tools,
such as Presto/Trino, have defined very clean APIs that extensions use. In
Druid, everything is integrated via Google Guice and an extension can
replace any part of Druid (though, I'm not convinced that's actually a good
idea.) I'm sure there are others we can learn from.

So, we need to define a plugin API for Drill. I started down that route a
while back: the first step was to refactor the plugin registry so it is
ready for extensions. The idea was to use the same mechanism for all kinds
of extensions (security, UDFs, metastore, etc.) The next step was to build
something that roughly followed Presto, but that kind of stalled out.

In terms of ordering, we'd first need to define the plugin API. Then, we
can shift plugins to use that. Once that is done, we can move plugins to
separate projects. (The metastore implementation can also move, if we
want.) Finally, figure out a solution for Ted's suggestion to make it easy
to grab new extensions. Drill is distributed, so adding a new plugin has to
happen on all nodes, which is a bit more complex than the typical
Julia/Python/R kind of extension.

The reason we're where we're at is that it is the path of least resistance.
Creating a good extension mechanism is hard, but valuable, as Ted noted.

Thanks,

- Paul

On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning  wrote:

> The bigger reason for a separate plug-in world is the enhancement of
> community.
>
> I would recommend looking at the Julia community for examples of
> effective ways to drive plug in structure.
>
> At the core, for any pure julia package, you can simply add a package by
> referring to the github repository where the package is stored. For
> packages that are "registered" (i.e. a path and a checksum is recorded in a
> well known data store), you can add a package by simply naming it without
> knowing the path.  All such plugins are tested by the authors and the
> project records all dependencies with version constraints so that cascading
> additions are easy. The community leaders have made tooling available so
> that you can test your package against a range of versions of Julia by
> pretty simple (to use) Github actions.
>
> The result has been an absolute explosion in the number of pure Julia
> packages.
>
> For packages that include C or Fortran (or whatever) code, there is some
> amazing tooling available that lets you record a build process on any of
> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, OSX
> and so on). WHen you register such a package, it is automagically built on
> all the platforms you indicate and the binary results are checked into a
> central repository known as Yggdrasil.
>
> All of these registration events for different packages are recorded in a
> central registry as I mentioned. That registry is recorded in Github as
> well which makes it easy to propagate changes.
>
>
>
> On Thu, Jan 13, 2022 at 8:45 PM James Turton  wrote:
>
> > Hello dev community
> >
> > Discussions about reorganising the Drill source code to better position
> > the project to support plug-ins for the "long tail" of weird and
> > wonderful systems and data formats have been coming up here and there
> > for a few months, e.g. in https://github.com/apache/drill/pull/2359.
> >
> > A view which I personally share is that adding too large a number and
> > variety of plug-ins to the main tree would create a lethal maintenance
> > burden for developers working there and lead down a road of accumulating
> > technical debt.  The Maven tricks we must employ to harmonise the
> > growing set of dependencies of the main tree to keep it buildable are
> > already enough, as is the size of our distributable and the count of
> > open bug reports.
> >
> >
> > Thus, the idea of splitting out "/contrib" into a new
> > apache/drill-contrib repo after selecting a subset of plugins to remain
> > in apache/drill.  I'll now volunteer a set of criteria to decide whether
> > a plug-in should live in this notional apache/drill-contrib.
> >
> >  1. The plug-in queries an 

Re: [DISCUSS] Per User Access Controls

2022-01-13 Thread Paul Rogers
Hey All,

Other members of the Hadoop Ecosystem rely on external systems to handle
permissions: Ranger or Sentry. There is probably something different in the
AWS world.

As you look into security, you'll see that you need to maintain permissions
on many entities: files, connections, etc. You need different permissions:
read, write, create, etc. In larger groups of people, you need roles: admin
role, sales analyst role, production engineer role. Users map to roles, and
roles take permissions.

Creating this just for Drill is not effective: no one wants to learn a
Drill "Security Store" any more than folks want to learn the "Drill
metastore". Drill is seldom the only tool in a shop: people want to set
permissions in one place, not in each tool. So, we should integrate with
existing tools.

Drill should provide an API, and be prepared to enforce rules. Drill
defines the entities that can be secured, and the available permissions.
Then, it is up to an external system to provide user identity, take tuples
of (user, resource, permission) and return a boolean of whether that user
is authorized or not. MapR, Pam, Hadoop and other systems would be
implemented on top of the Drill permissions API, as would whatever need you
happen to have.

Thanks,

- Paul

On Thu, Jan 13, 2022 at 12:32 PM Curtis Lambert 
wrote:

> This is what we are handling with Vault outside of Drill, combined with
> aliasing. James is tracking some of what you've been finding with the
> credential store but even then we want the single source of auth. We can
> chat with James on the next Drill stand up (and anyone else who wants to
> feel the pain).
>
>
>
> [image: avatar]
> Curtis Lambert
> CTO
> Email:
>
> cur...@datadistillr.com 
> Phone:
>
> + 706-402-0249
> [image: LinkedIn]LinkedIn
>  [image: Calendly]
> Calendly 
> [image: Data Distillr logo] 
>
>
> On Thu, Jan 13, 2022 at 3:29 PM Charles Givre  wrote:
>
> > Hello all,
> > One of the issues we've been dancing around is having per-user access
> > controls in Drill.  As Drill was originally built around the Hadoop
> > ecosystem, the Hadoop based connections make use of user-impersonation
> for
> > per user access controls.  However, a rather glaring deficiency is the
> lack
> > of per-user access controls for connections like JDBC, Mongo, Splunk etc.
> >
> > Recently when I was working on OAuth pull request, it occurred to me that
> > we might be able to slightly extend the credential provider interface to
> > allow for per-user credentials.  Here's what I was thinking...
> >
> > A bit of background:  The credential provider interface is really an
> > abstraction for a HashMap.  Here's my proposal The cred provider
> > interface would store two hashmaps, one for per-user creds and one for
> > global creds.   When a user is authenticated to Drill, when they create a
> > storage plugin connection, the credential provider would associate the
> > creds with their Drill username.  The storage plugins that use credential
> > provider would thus get per-user credentials.
> >
> > If users did not want per-user credentials, they could simply use direct
> > credentials OR use specify that in the credential provider classes.  What
> > do you think?
> >
> > Best,
> > -- C
> >
> >
>


Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-04 Thread Paul Rogers
Hi Ted,

I like where you're going with how to manage the discussion.

Here's a trick that I saw someone do recently. The design/discussion as a
PR.
Comments are just code review comments, tagged to a specific line. The "er,
never mind"
aspect that Ted talks about is handled by pushing a new version of the doc
(if the doc contains the error) or editing a comment (if the comment had the
error.) The history of all changes is in the commit history.

As we go off on tangents (Arrow-based API? Modern way to do code gen?),
these can
be handled as new documents.

All we need is a place to put this stuff. A "docs" or "design" directory
within the
source tree?

Thanks,

- Paul

On Tue, Jan 4, 2022 at 11:15 AM Ted Dunning  wrote:

> Exactly. I very much had in mind an "On the other hand" kind of document.
>
> The super benefit of a non-threaded presentation is that if I advocate
> something stupid due to an oversight on my part, I can go back and edit
> away the stupid statement (since it shouldn't be part of the consensus) and
> tag anybody who might have responded. I might even leave a note saying "You
> might think X, but that isn't so because of Y" to help later readers.
>
> That is all very, very hard to do in threaded discussions.
>
>
>
> On Tue, Jan 4, 2022 at 9:37 AM James Turton  wrote:
>
> > Ah, and I see now that you said as much already.  So a collaboratively
> > edited document?  Wiki pages containing a variety of independent views
> > might turn out something like this collection I suppose
> >
> > https://wiki.c2.com/?GarbageCollection
> >
> > which isn't bad IMHO.
> >
> > On 2022/01/04 16:42, Ted Dunning wrote:
> > > Threading is exactly what I would want to avoid.
> > >
> > >
> > >
> > > On Tue, Jan 4, 2022, 3:58 AM James Turton  > > > wrote:
> > >
> > > Hi all
> > >
> > > GitHub Issues allow a conversation thread with rich formatting so I
> > > propose that we use them for meaty topics like this.  Please use
> the
> > > "Feature Request" issue template for this purpose, and set the
> > issue's
> > > Project field to "Drill 2.0"[1], said project having recently been
> > > created by Charles.  I am busy transcribing the current discussion
> > from
> > > the mailing list and a GitHub PR to just such a new feature request
> > at
> > >
> > > https://github.com/apache/drill/issues/2421
> > > 
> > >
> > > James
> > >
> > > [1] https://github.com/apache/drill/projects/1
> > > 
> > >
> > > On 2022/01/04 09:49, Ted Dunning wrote:
> > >  > I wonder if there isn't a better place for this discussion?
> > >  >
> > >  > As you point out, there are many threads and many of the points
> > > are rather
> > >  > contentious technically. That will make them even harder to
> > > follow in an
> > >  > email thread.
> > >  >
> > >  > We could just use the wiki and format the text in the form of
> > > questions
> > >  > with alternative positions.
> > >  >
> > >  > Or we could use an open google document with similar form.
> > >  >
> > >  > What's the preference here?
> > >  >
> > >
> >
>


Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Paul Rogers
Hi Charles,

The material is rather dense and benefits from the Github formatting. To
preserve it, perhaps we can copy it to a subpage of the Drill 2.0 wiki page.

For now, the link to the discussion is [1]. Since the Wiki is not good for
discussions, let's have that discussion here (if anyone is up to tackling
such a weighty subject.)

Thanks,

- Paul

[1] https://github.com/apache/drill/pull/2412

On Mon, Jan 3, 2022 at 5:15 PM Charles Givre  wrote:

> @Paul,
> Do you mind if I copy the contents of your response to DRILL-8088 to this
> thread?   There's a lot of good info there, and I'd hate to see it get lost.
> -- C
>
> > On Jan 3, 2022, at 7:41 PM, Paul Rogers  wrote:
> >
> > Hi All,
> >
> > Thanks Charles for dredging up that old discussion, your memory is better
> > than mine! And, thanks Ted for that summary of MapR history. As one of
> the
> > "replacement crew" brought in after the original folks left, your
> > description is consistent with my memory of events. Moreover, as we
> looked
> > at what was needed to run Drill in production, an Arrow port was far down
> > on the list: it would not have solved actual customer problems.
> >
> > Before we get too excited about Arrow, I think we should have a
> discussion
> > about what we want in an internal storage format. I added a long (sorry)
> > set of comments in that PR that Charles mentioned that tries to debunk
> the
> > myths that have grown up around using a columnar format as the internal
> > representation for a query engine. (Columnar is great for storage.) The
> > note presents the many issues we've encountered over the years that have
> > caused us to layer ever more code on top of vectors to solve various
> > problems. It also highlights a distributed-systems problem which vectors
> > make far worse.
> >
> > Arrow is meant to be portable, as Ted discussed, but it is still
> columnar,
> > and this is the source of endless problems in an execution engine. So, we
> > want to ask, what is the optimal format for what Drill actually does? I'm
> > now of the opinion that Drill might actually better benefit  from a
> > row-based format, similar to what Impala uses. The notes even paint a
> path
> > forward.
> >
> > Ted's description of the goal for Demio suggests that Arrow might be the
> > right answer for that market. Drill, however, tends to be used to query
> > myriad data sources at scale and as a "query integrator" across systems.
> > This use case has different needs, which may be better served with a
> > row-based format.
> >
> > The upshot is that "value vectors vs. Arrow" is the wrong place to start
> > the discussion. The right place is "what does our many years of
> experience
> > with Drill suggest is the most efficient format for how Drill is actually
> > used?"
> >
> > Note that Drill could have an Arrow-based API independent of the internal
> > format. The quote from Charles explains how we could do that.
> >
> > Thanks,
> >
> > - Paul
> >
> > On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning 
> wrote:
> >
> >> Christian,
> >>
> >> Your thoughts are very helpful. I find Arrow very nice (I use it in
> Agstack
> >> with Julia and Python).
> >>
> >> I don't think anybody is saying that Drill wouldn't be well set with a
> >> switch to Arrow or even just interfaces to Arrow. But it is a lot of
> work
> >> to make it all happen.
> >>
> >>
> >>
> >> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix  wrote:
> >>
> >>> Hi Charles, Ted, and the others here,
> >>>
> >>> it is very interesting to hear the evolution of Drill, Dremio and Arrow
> >> in
> >>> that context and thank you Charles for restarting that discussion.
> >>>
> >>> I think, and James mentioned this in the PR as well, that Drill could
> >>> benefit from the continues progress, the Arrow project has made since
> its
> >>> separation from Drill. And the arrow Community seems to be large, so i
> >>> assume this goes on and on with improvements, new features, etc. but i
> >> have
> >>> not enough experience in Drill internals to have an Idea in which mass
> of
> >>> refactoring this would lead.
> >>>
> >>> In addition to that, im not aware of the current roadmap of Arrow and
> if
> >>> these would fit into Drills roadmap. Maybe Arrow would go into a
> >> different
> >>> direction tha

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Paul Rogers
omeone who never worked for MapR or Dremio. This just represents my
> > understanding of events as an outsider, and I could be wrong about some
> or
> > all of this. Please forgive (or correct) any inaccuracies.
> > >
> > > When I first learned of Arrow and the idea of integrating Arrow with
> > Drill, the thing that interested me the most was the ability to move data
> > between platforms without having to serialize/deserialize the data. From
> my
> > understanding, MapR did some research and didn't find a significant
> > performance advantage and hence didn't really pursue the integration. The
> > other side of it was that it would require a significant amount of work
> to
> > refactor major parts of Drill.
> > >
> > > I don't know the internal politics, but this was one of the major
> points
> > of diversion between Dremio and Drill.
> > >
> > > With that said, there was a renewed discussion on the list [2] where
> > Paul Rogers proposed what he described as a "Crude but Effective"
> approach
> > to an Arrow integration.
> > >
> > > This is in the email link but here was a part of Paul's email:
> > >
> > >> Charles, just brainstorming a bit, I think the easiest way to start is
> > to create a simple, stand-alone server that speaks Arrow to the client,
> and
> > uses the native Drill client to speak to Drill. The native Drill client
> > exposes Drill value vectors. One trick would be to convert Drill vectors
> to
> > the Arrow format. I think that data vectors are the same format. Possibly
> > offset vectors. I think Arrow went its own way with null-value (Drill's
> > is-set) vectors. So, some conversion might be a no-op, others might need
> to
> > rewrite a vector. Good thing, this is purely at the vector level, so
> would
> > be easy to write. The next issue is the one that Parth has long pointed
> > out: Drill and Arrow each have their own memory allocators. How could we
> > share a data vector between the two? The simplest initial solution is
> just
> > to copy the data from Drill to Arrow. Slow, but transparent to the
> client.
> > A crude first-approximation of the development steps:
> > >>
> > >> A crude first-approximation of the development steps:
> > >> 1. Create the client shell server.
> > >> 2. Implement the Arrow client protocol. Need some way to accept a
> query
> > and return batches of results.
> > >> 3. Forward the query to Drill using the native Drill client.
> > >> 4. As a first pass, copy vectors from Drill to Arrow and return them
> to
> > the client.
> > >> 5. Then, solve that memory allocator problem to pass data without
> > copying.
> > >
> > > One point that Paul made was that these pieces are fairly discrete and
> > could be implemented without refactoring major components of Drill. Of
> > course, this could be something for Drill 2.0. At a minimum, could we
> take
> > the conversation off of the PR and put it in the email list? ;-)
> > >
> > > Let's discuss... All ideas are welcome!
> > >
> > > Best,
> > > -- C
> > >
> > >
> > > [1]: https://github.com/apache/drill/pull/2412 <
> > https://github.com/apache/drill/pull/2412>
> > > [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l
> <
> > https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>
> > >
> > >
> > >
> >
> >
>


[jira] [Created] (DRILL-8102) Tests use significant space outside the drill directory

2022-01-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8102:
--

 Summary: Tests use significant space outside the drill directory
 Key: DRILL-8102
 URL: https://issues.apache.org/jira/browse/DRILL-8102
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers


I use a Linux Mint (Ubuntu) machine in which the root file system has limited 
space, but /user has a large amount of space. My Drill build directory is 
within my home directory in /user. Most tests write to the various target 
folders within the Drill directory, which ensures that each test is isolated, 
and that test files are removed in a {{{}mvn clean{}}}.

However, it appears that some tests, perhaps Cassandra, ElasticSearch or Spunk, 
write to directories outside of Drill, perhaps to /tmp, /var, etc. The result 
is that, each time I run the tests, I get low disk-space warnings on my root 
file system. In the worst case, the tests fail due to lack of disk space.

Since it is not clear where the files are written, it is not clear what I 
should clean up, or how I might add a sym link to a location with more space. 
(Yes, I could get a bigger SSD, and rebuild my root file system, but I'm 
lazy...)

As a general rule, all Drill tests should write to a target directory. If that 
is not possible, then clearly state somewhere what directories are used so that 
sufficient space can be provided, and we know where to go clean up files once 
the build runs.

Perhaps some of the tests start Docker containers? If so, then, again, it 
should be made clear how much cache space Docker will require.

Another suggestion is to change the build order. Those tests which require 
external resources should occur last, after all the others (UDFs, Syslog, etc.) 
which require only Drill. That way, if failures occur in the external systems, 
we at least know the core Drill modules work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8101) Resolve the TIMESTAMP madness

2022-01-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8101:
--

 Summary: Resolve the TIMESTAMP madness
 Key: DRILL-8101
 URL: https://issues.apache.org/jira/browse/DRILL-8101
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers


Drill's TIMESAMP type tries to be two different things at the same time, 
causing incorrect results when the two interpretations collide.

Drill has the classic DATE and TIME data types. A DATE is just that: a day 
wherever you happen to be. Your birthday goes from midnight to midnight in the 
time zone where you find yourself. If you happen to travel around the world, 
you can make your birthday last almost 48 hours as midnight of your birthday 
starts at the international date line, circles the globe, followed by the 
midnight of the next day.

Similarly, a time is a time where you are. 12:00PM is noon (more-or-less) as 
determined by the sun. 12:00PM occurs once in every time zone every day. Since 
there are many time zones, there are many noons each day.

These are both examples of local time. Most databases combine these two ideas 
to get a DATETIME: a date and time wherever you are.

In our modern world, knowing something occurred on 2022-01-02 12:00:00 is not 
good enough. Did it occur at that time in my time zone or yours? If the event 
is a user login, or a network breach, then it occurred once, at a specific 
time, it did not occur many times: once in each time zone. Hence, machines 
often use UTC time to coordinate.

Unix-like systems also define the idea of a "timestamp", the number of seconds 
(or milliseconds or nanoseconds) since 1970-01-01 00:00:00. This is the time 
reported by Java in the {{System.currentTime()}} function. It is the time most 
often found in machine-generated logs. It may be as a number (ms since the 
epoch) or as an ISO-formatted string.

Thus, users of Drill would expect to find a "timestamp" type that represents a 
UTC timestamp in Unix format. The will be disappointed, however.

Drill's TIMESTAMP type is essentially a DATETIME type: it is a date/time in an 
unspecified timezone and that zone can be whatever you want it to be. UTC? 
Fine. Local? OK. Nairobi? Sure, why not.

This works fine as long as _all_ your data is in the same time zone, and you 
don't need a concept of "now". As described in DRILL-8099 and DRILL-8100, this 
is how the authors of CTAS thought of it: read Parquet data straight into Drill 
with no conversion, then write it back out to JSON with no conversion. Both 
work with UTC, so the result is fine: who cares that the 32-bit number, when in 
Drill, had no implied time zone? It is just a number we read then write. All 
good.

It is even possible to compute the difference of two DATETIMEs with unspecified 
time zone: that's what an INTERVAL does. As long as the times are actually in 
the same zone (UTC, say, or local, or Nairobi), then all is fine.

Everything collapses, however, when someone wants to know, "but how long ago 
was that event"? "Long enough ago that I need to raise the escalation level?" 
Drill has the INTERVAL type to give us the difference, but how do I get "now"? 
Drill has {{CURRENT_TIMESTAMP}}. But, how we have a problem, what timezone is 
that time in? UTC? My local timezone? Nairobi? And, what if my data is UTC but 
{{CURRENT_TIMESTAMP}} is local? Or visa-versa? The whole house of cards comes 
crashing down.

Over the years, this bug has appeared again and again. Sometimes people change 
the logic to assume TIMESTAMP is UTC. Sometimes things are changed to assume 
TIMESTAMP is local time (I've been guilty of this). Sometimes we just punt, and 
require that the machine (or test) run only in UTC, since that's the only place 
the two systems coincide.

But, in fact, I believe that the original designers of Drill meant TIMESTAMP to 
have _no_ timezone: two TIMESTAMP values could be in entirely different 
(unknown) timezones! One can see vestiges of this in the value vector code. It 
seems the original engineers imagined a "TIMESTAMP_WITH_ZONE" type, similar to 
Java's (or Joda's) {{ZonedDateTime}} type. Other bits of code (Parquet) refers 
to a never-built "TIMESTAMPZ" type for a UTC timestamp. When faced with the 
{{CURRENT_TIMESTAMP}} issue, fixes started down the path of saying that 
TIMESTAMP is local time, but this is probably a misunderstanding of the 
original design, forced upon us by the gaps in that original design.

Further, each time we make a change (such as DRILL-8099 and DRILL-8100), we 
change behavior, potentially breaking a kludge that someone found to 
kinda-sorta make things work.

Since computers can't deal with ambiguity the way humans can, we need a 
solution. It is not good enough for you to think "TIMESTAMP is UTC" and me to 
think "TIMESTAMP is local" and for B

[jira] [Created] (DRILL-8100) JSON record writer does not convert Dril local timestamp to UTC

2022-01-01 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8100:
--

 Summary: JSON record writer does not convert Dril local timestamp 
to UTC
 Key: DRILL-8100
 URL: https://issues.apache.org/jira/browse/DRILL-8100
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Drill follows the old SQL engine convention to store the `TIMESTAMP` type in 
the local time zone. This is, of course, highly awkward in today's age when UTC 
is used as the standard timestamp in most products. However, it is how Drill 
works. (It would be great to add a `UTC_TIMESTAMP` type, but that is another 
topic.)

Each reader or writer that works with files that hold UTC timestamps must 
convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise, 
Drill works correctly only when the server time zone is set to UTC.

The JSON writer does not do the proper conversion, causing tests to fail when 
run in a time zone other than UTC.

{noformat}
  @Override
  public void writeTimestamp(FieldReader reader) throws IOException {
if (reader.isSet()) {
  writeTimestamp(reader.readLocalDateTime());
} else {
  writeTimeNull();
}
  }
{noformat}

Basically, it takes a {{LocalDateTime}}, and formats it as a UTC timezone 
(using the "Z" suffix.) This is only valid if the machine is in the UTC time 
zone, which is why the test for this class attempts to force the local time 
zone to UTC, something that must users will not do.

A consequence of this bug is that "round trip" CTAS will change dates by the 
UTC offset of the machine running the CTAS. In the Pacific time zone, each 
"round trip" subtracts 8 hours from the time. After three round trips, the 
"UTC" date in the Parquet file or JSON will be a day earlier than the original 
data. One might argue that this "feature" is not always helpful.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8099) Parquet record writer does not convert Dril local timestamp to UTC

2021-12-31 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8099:
--

 Summary: Parquet record writer does not convert Dril local 
timestamp to UTC
 Key: DRILL-8099
 URL: https://issues.apache.org/jira/browse/DRILL-8099
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Drill follows the old SQL engine convention to store the `TIMESTAMP` type in 
the local time zone. This is, of course, highly awkward in today's age when UTC 
is used as the standard timestamp in most products. However, it is how Drill 
works. (It would be great to add a `UTC_TIMESTAMP` type, but that is another 
topic.)

Each reader or writer that works with files that hold UTC timestamps must 
convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise, 
Drill works correctly only when the server time zone is set to UTC.

Now, perhaps we can convince must shops to run their Drill server in UTC, or at 
least set the JVM timezone to UTC. However, this still leads developers in a 
lurch: if the development machine timezone is not UTC, then some tests fail. In 
particular:

{{TestNestedDateTimeTimestamp.testNestedDateTimeCTASParquet}}

The reason that the above test fails is that the generated Parquet writer code 
assumes (incorrectly) that the Drill timestamp is in UTC and so no conversion 
is needed to write that data into Parquet. In particular, in 
{{ParquetOutputRecordWriter.getNewTimeStampConverter()}}:

{noformat}
reader.read(holder);
consumer.addLong(holder.value);
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8087) {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASExtendedJson}} assumes time zone

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8087:
--

 Summary: 
{{TestNestedDateTimeTimestamp.testNestedDateTimeCTASExtendedJson}} assumes time 
zone
 Key: DRILL-8087
 URL: https://issues.apache.org/jira/browse/DRILL-8087
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
 Environment: 

Reporter: Paul Rogers


Drill's date types follow older SQL engines: dates and times are assumed to be 
in the local time zone. However, most modern applications uses UTC timestamps 
to avoid the issues that crop up when using local times in systems that span 
time zones.

The {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASExtendedJson}} unit 
tests seems to assume that the test runs in a particular time zone. When run on 
a machine in the Pacific time zone, the test fails:

{noformat}
java.lang.Exception: at position 0 column '`time_map`' mismatched values, 
expected: {"date":"1970-01-11","time":"00:00:03.600","timestamp":"2018-03-23 
17:40:52.123"}(JsonStringHashMap) but received 
{"date":"1970-01-11","time":"00:00:03.600","timestamp":"2018-03-23 
10:40:52.123"}(JsonStringHashMap)

Expected Records near verification failure:
Record Number: 0 { `date_list` : ["1970-01-11"],`date` : 1970-01-11,`time_list` 
: ["00:00:03.600"],`time_map` : 
{"date":"1970-01-11","time":"00:00:03.600","timestamp":"2018-03-23 
17:40:52.123"},`time` : 00:00:03.600,`timestamp_list` : ["2018-03-23 
17:40:52.123"],`timestamp` : 2018-03-23T17:40:52.123, }

Actual Records near verification failure:
Record Number: 0 { `date_list` : ["1970-01-11"],`date` : 1970-01-11,`time_list` 
: ["00:00:03.600"],`time_map` : 
{"date":"1970-01-11","time":"00:00:03.600","timestamp":"2018-03-23 
10:40:52.123"},`time` : 00:00:03.600,`timestamp_list` : ["2018-03-23 
10:40:52.123"],`timestamp` : 2018-03-23T10:40:52.123, }

For query: select * from `ctas_nested_datetime_extended_json` t1 
{noformat}

Notice the time differences: {*}17{*}:40:52.123 (expected), {*}10{*}:40:52.123 
(actual).

Since this test causes the build to fail in my time zone, the test will be 
disabled in my PR. Enabled it again when the timezone issue is fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8086) Convert the CSV (AKA "compliant text") reader to EVF V2

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8086:
--

 Summary: Convert the CSV (AKA "compliant text") reader to EVF V2
 Key: DRILL-8086
 URL: https://issues.apache.org/jira/browse/DRILL-8086
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Work was done some time ago to convert the CSV reader to use EVF V3. Merge that 
work into the master branch.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8085) EVF V2 support in the "Easy" format plugin

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8085:
--

 Summary: EVF V2 support in the "Easy" format plugin
 Key: DRILL-8085
 URL: https://issues.apache.org/jira/browse/DRILL-8085
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Add support for EVF V2 to the {{EasyFormatPlugin}} similar to how EVF V1 
support already exists. Provide examples for others to follow.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8084) Scan LIMIT pushdown fails across files

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8084:
--

 Summary: Scan LIMIT pushdown fails across files
 Key: DRILL-8084
 URL: https://issues.apache.org/jira/browse/DRILL-8084
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


DRILL-7763 apparently added limit pushdowns to the file format plugins, which 
is a nice improvement. Unfortunately, the implementation only works for a scan 
with a single file: the limit is applied to each file independently. The 
correct implementation is to apply the limit to the {_}scan{_}, not the 
{_}file{_}.

Further, `LIMIT 0` has meaning: it asks to return a schema with no data. 
However, the implementation uses a {{maxRecords == 0}} to mean no limit, and a 
bit of code explicitly changes `LIMIT 0` to `LIMIT 1` so that "we read at least 
one file".

Consider and example. Two files, A and B, each of which have 10 records:
 * {{{}LIMIT 0{}}}: Obtain the schema from A, read no data from A. Do not open 
B. The current code changes {{LIMIT 0}} to {{{}LIMIT 1{}}}, thus returning data.
 * {{{}LIMIT 1{}}}: Read one record from A, none from B. (Don't even open B.) 
The current code will read 1 record from A and other from B.
 * {{{}LIMIT 15{}}}: Read all 10 records from A, and only 5 from B. The current 
code applies the limit of 15 to both files, thus reading 20 records.

The correct solution is to manage the {{LIMIT}} at the scan level. As each file 
completes, subtract the returned row count from the limit applied to the next 
file.

And, at the file level, there is no need to have each file count its records 
and check the limit on each row read. The "result set loader" already checks 
batch limits: it is the place to check the overall limit.

For this reason, the V2 EVF scan framework has been extended to manage the 
scan-level part, and the "result set loader" has been extended to enforce the 
per-file limit. The result is that readers need do...absolutely nothing; 
{{LIMIT}} pushdown is automatic.

EVF V1 has also been extended, but is less thoroughly tested since the desired 
path is to upgrade all readers to use EVF V2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8083) HttpdLogBatchReader creates unnecessary empty maps

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8083:
--

 Summary: HttpdLogBatchReader creates unnecessary empty maps
 Key: DRILL-8083
 URL: https://issues.apache.org/jira/browse/DRILL-8083
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers


Run the {{TestHTTPDLogReader.testStarRowSet}} test. Set a breakpoint in
{{MapWriter.SingleMapWriter.endWrite}}. Step into the {{super.endWrite()}}
method which will walk the set of child fields. Notice that there are none
for any of the several map fields.

One can see that empty maps are expected in the
{{TestHTTPDLogReader.expectedAllFieldsSchema()}} method.

Maps (i.e. tuples) are not well defined in SQL. Although Drill makes great
efforts to support them, an empty tuple is not well defined even in Drill:
there is nothing one can do with such fields.

Suggestion: don't create a map field if there are to be no members of the
map.

Affected maps:

* {{request_firstline_original_uri_query_$}}
* {{request_firstline_uri_query_$}}
* {{request_referer_last_query_$}}
* {{request_referer_query_$}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: A new developer wiki begins!

2021-11-02 Thread Paul Rogers
Hi All,

Apache projects welcome many contributors. It is very unusual for any
contributor to be named separately from others. For example, we don't name
the original Drill developers, nor do we name the person who tirelessly
worked to write the Drill docs and website. If the Wiki is healthy, many
will contribute over time; it will be awkward to figure out who did more:
more text? More commits? More answering the questions recorded in an edit?

We're on the committer's page: that's consistent with how Apache projects
work.

Thanks!

- Paul

On Tue, Nov 2, 2021 at 6:33 AM James Turton  wrote:

> Hi Charles
>
> When I first took this idea to Paul I proposed that we attribute
> authorship but he declined that bit.  We do have the Git history for the
> wiki, and the lines shown for the last Git commit to affect a page are
> quite visible in the wiki, e.g.
>
> > Paul Rogers edited this page on 27 Apr 2020.
>
> But those will of course fade over time as others add commits.  I did not
> argue the matter, just concluded with "if you ever change your mind, tell
> us and we will add an attribution".  To give you an idea of what Cong's
> table of authors might look like if it was ranked by number of commits,
> here's the output of git shortlog -sn.
>
>752  Paul Rogers
>  8  Mohamed Gelbana
>  1  Boaz Ben-Zvi
>  1  Dobes Vandermeer
>  1  Muhammad Gelbana
>
> The size of Paul's contribution is humbling.  We can still add a page
> with author names (with or without any edit stats) on it, I wouldn't expect
> Paul to object.  He seemed mostly to be saying "it's not necessary for me".
>
>
> On 2021/11/01 02:03, luoc wrote:
>
> That is good advice. I recommend adding a page (or a table list) listing all 
> the wiki contributors. Paul is the founding member.
>
>
> 在 2021年11月1日,01:45,Charles Givre   写道:
>
> This is great!  Can we give @paul-rogers some credit on these pages?  Also 
> I'd really love to merge the existing dev docs in the github repo with the 
> wiki docs.  I'm willing to help with that, time permitting.
> -- C
>
>
> On Oct 30, 2021, at 5:59 AM, luoc   wrote:
>
>
> It cannot get any better than this!
>
>
> 2021年10月30日 下午5:39,James Turton   写道:
>
> I'm delighted to report that the gold mine of developer information that is 
> Paul Rogers' Drill wiki has just formed the basis of a new Drill developer 
> wiki.
> https://github.com/apache/drill/wiki
>
> The community would like to thank Paul for this sizeable and valuable 
> contribution, and for his blessing that we proceed to merge the work under 
> the normal Apache contributor terms.
>
> Our work here is just beginning.  A wiki is never a completed work, but 
> requires ongoing editing from all of us to remain complete and accurate.  
> Let's go on to make it the powerful asset for future Drill developers that it 
> certainly can be.
>
> James
>
>
>
>


[jira] [Resolved] (DRILL-7325) Many operators do not set container record count

2021-04-25 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7325.

Resolution: Fixed

A number of individual commits fixed problems found in each operator. This 
overall task is now complete.

> Many operators do not set container record count
> 
>
> Key: DRILL-7325
> URL: https://issues.apache.org/jira/browse/DRILL-7325
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>    Reporter: Paul Rogers
>    Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.19.0
>
>
> See DRILL-7324. The following are problems found because some operators fail 
> to set the record count for their containers.
> h4. Scan
> TestComplexTypeReader, on cluster setup, using the PojoRecordReader:
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from ScanBatch
> ScanBatch: Container record count not set
> Reason: ScanBatch never sets the record count of its container (this is a 
> generic issue, not specific to the PojoRecordReader).
> h4. Filter
> {{TestComplexTypeReader.testNonExistentFieldConverting()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from FilterRecordBatch
> FilterRecordBatch: Container record count not set
> {noformat}
> h4. Hash Join
> {{TestComplexTypeReader.test_array()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from HashJoinBatch
> HashJoinBatch: Container record count not set
> {noformat}
> Occurs on the first batch in which the hash join returns {{OK_NEW_SCHEMA}} 
> with no records.
> h4. Project
> TestCsvWithHeaders.testEmptyFile()}} (when the text reader returned empty, 
> schema-only batches):
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from ProjectRecordBatch
> ProjectRecordBatch: Container record count not set
> {noformat}
> Occurs in {{ProjectRecordBatch.handleNullInput()}}: it sets up the schema but 
> does not set the value count to 0.
> h4. Unordered Receiver
> {{TestCsvWithSchema.testMultiFileSchema()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from UnorderedReceiverBatch
> UnorderedReceiverBatch: Container record count not set
> {noformat}
> The problem is that {{RecordBatchLoader.load()}} does not set the container 
> record count.
> h4. Streaming Aggregate
> {{TestJsonReader.testSumWithTypeCase()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from StreamingAggBatch
> StreamingAggBatch: Container record count not set
> {noformat}
> The problem is that {{StreamingAggBatch.buildSchema()}} does not set the 
> container record count to 0.
> h4. Limit
> {{TestJsonReader.testDrill_1419()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from LimitRecordBatch
> LimitRecordBatch: Container record count not set
> {noformat}
> None of the paths in {{LimitRecordBatch.innerNext()}} set the container 
> record count.
> h4. Union All
> {{TestJsonReader.testKvgenWithUnionAll()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from UnionAllRecordBatch
> UnionAllRecordBatch: Container record count not set
> {noformat}
> When {{UnionAllRecordBatch}} calls 
> {{VectorAccessibleUtilities.setValueCount()}}, it did not also set the 
> container count.
> h4. Hash Aggregate
> {{TestJsonReader.drill_4479()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from HashAggBatch
> HashAggBatch: Container record count not set
> {noformat}
> Problem is that {{HashAggBatch.buildSchema()}} does not set the container 
> record count to 0 for the first, empty, batch sent for {{OK_NEW_SCHEMA.}}
> h4. And Many More
> I turns out that most operators fail to set one of the many row count 
> variables somewhere in their code path: maybe in the schema setup path, maybe 
> when building a batch along one of the many paths that operators follow. 
> Further, we have multiple row counts that must be set:
> * Values in each vector ({{setValueCount()}},
> * Row count in the container ({{setRecordCount()}}), which must be the same 
> as the vector value count.
> * Row count in the operator (batch), which is the (possibly filtered) count 
> of records presented to downstream operators. It must be less than or equal 
> to the c

[jira] [Resolved] (DRILL-6953) Merge row set-based JSON reader

2021-04-25 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-6953.

Resolution: Fixed

Resolved via  series of individual tickets.

> Merge row set-based JSON reader
> ---
>
> Key: DRILL-6953
> URL: https://issues.apache.org/jira/browse/DRILL-6953
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.15.0
>    Reporter: Paul Rogers
>    Assignee: Paul Rogers
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.19.0
>
>
> The final step in the ongoing "result set loader" saga is to merge the 
> revised JSON reader into master. This reader does two key things:
> * Demonstrates the prototypical "late schema" style of data reading (discover 
> schema while reading).
> * Implements many tricks and hacks to handle schema changes while loading.
> * Shows that, even with all these tricks, the only true solution is to 
> actually have a schema.
> The new JSON reader:
> * Uses an expanded state machine when parsing rather than the complex set of 
> if-statements in the current version.
> * Handles reading a run of nulls before seeing the first data value (as long 
> as the data value shows up in the first record batch).
> * Uses the result-set loader to generate fixed-size batches regardless of the 
> complexity, depth of structure, or width of variable-length fields.
> While the JSON reader itself is helpful, the key contribution is that it 
> shows how to use the entire kit of parts: result set loader, projection 
> framework, and so on. Since the projection framework can handle an external 
> schema, it is also a handy foundation for the ongoing schema project.
> Key work to complete after this merger will be to reconcile actual data with 
> the external schema. For example, if we know a column is supposed to be a 
> VarChar, then read the column as a VarChar regardless of the type JSON itself 
> picks. Or, if a column is supposed to be a Double, then convert Int and 
> String JSON values into Doubles.
> The Row Set framework was designed to allow inserting custom column writers. 
> This would be a great opportunity to do the work needed to create them. Then, 
> use the new JSON framework to allow parsing a JSON field as a specified Drill 
> type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7789) Exchanges are slow on large systems & queries

2020-09-23 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7789:
--

 Summary: Exchanges are slow on large systems & queries
 Key: DRILL-7789
 URL: https://issues.apache.org/jira/browse/DRILL-7789
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers


A user with moderate-sized cluster and query has experienced extreme slowness 
in exchanges. Up to 11/12 of the time is spent waiting in one query, 3/4 of 
time spent waiting in another. We suspect that exchanges are somehow 
serializing across the cluster.

Cluster:
 * Drill 1.16 (MapR version)
 * MapR-FS
 * Data stored in a 8GB Parquet file, unpacks to about 80 GB, 20B records
 * 4 Drillbits
 * Each node has 56 cores, 400 GB of memory
 * Drill queries run with 40 fragments (70% of CPU) and 80 GB of memory

The query is, essentially:

{noformat}
Parquet writer
- Hash Join
  - Scan
  - Window, Sort
  - Window, Sort
  - Hash Join
- Scan
- Scan
{noformat}

In the above, each line represents a fragment boundary. The plan includes mux 
exchanges between the two "lower" scans and the hash join.

The total query  time is 6 hours. Of that, 30 minutes is spent working, the 
other 5.5 hours is spent waiting. (The 30 minutes is obtained by summing the 
"Avg Runtime" column in the profile.)

When checking resource usage with "top", we found that only a small amount of 
CPU was used. We should have seen 4000% (40 cores) but we actually saw just 
around 300-400%. This again indicates that the query spent most of its time 
doing nothing: not using CPU.

In particular the sender spends about 5 hours waiting for the receiver, which 
in turn spends about 5 hours waiting for the sender. This pattern occurs in 
every exchange in the "main" data path (the 20B records.)

As an experiment, the user disabled Mux exchanges. The system became overloaded 
at 40 fragments per node, so parallelism was reduced to 20. Now, the partition 
sender waited for the unordered receiver and visa-versa.

The original query incurred spilling. We hypothesized that the spilling caused 
delays which somehow rippled through the DAG. However, the user revised the 
query to eliminate spilling and to reduce the query to just the "bottom" hash 
join. The query ran for an hour, of which 3/4 of the time was again spent with 
senders and receivers waiting for each other.

We have eliminated a number of potential causes:

* System has sufficient memory
* MapRFS file system has plenty of spindles and plenty of I/O capability.
* Network is fast
* No other load on the nodes
* Query was simplified down to the simplest possible: a single join (with 
exchanges)
* If the query is simplified further (scan and write to Parquet, no join), it 
completes in just a few minutes: about as fast as the disk I/O rate.

The query profile does not provide sufficient information to dig further. The 
profile provides aggregate wait times, but does not, say, tell us which 
fragments wait for which other fragments for how long.

We believe that, if the exchange delays are fixed, the query which takes six 
hours should complete in less than a half hour -- even with shuffles, spilling, 
reading from Parquet and writing to Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release Apache Drill 1.18.0 - RC0

2020-09-02 Thread Paul Rogers
Hi Abhishek,

Downloaded the tar file, installed Drill, cleaned my ZK and poked around in
the UI.

As you noted, you've already run the thousands of unit tests and the test
framework, so no point in trying to repeat that. Our tests, however, don't
cover the UI much at all, so I clicked around on the basics to ensure
things basically work. Seems good.

To catch the odd cases, would be great if someone who uses Drill in
production could try it out. Until then, my vote is +1.

- Paul


On Tue, Sep 1, 2020 at 5:28 PM Abhishek Girish  wrote:

> Thanks Vova!
>
> Hey folks, we need more votes to validate the release. Please give RC0 a
> try.
>
> Special request to PMCs - please vote as we only have 1 binding vote at
> this point. I am fine extending the voting window by a day or two if anyone
> is or plans to work on it soon.
>
> On Tue, Sep 1, 2020 at 12:09 PM Volodymyr Vysotskyi 
> wrote:
>
> > Verified checksums and signatures for binary and source tarballs and for
> > jars published to the maven repo.
> > Run all unit tests on Ubuntu with JDK 8 using tar with sources.
> > Run Drill in embedded mode on Ubuntu, submitted several queries, verified
> > that profiles displayed correctly.
> > Checked JDBC driver using SQuirreL SQL client and custom java client,
> > ensured that it works correctly with the custom authenticator.
> >
> > +1 (binding)
> >
> > Kind regards,
> > Volodymyr Vysotskyi
> >
> >
> > On Mon, Aug 31, 2020 at 1:37 PM Volodymyr Vysotskyi <
> volody...@apache.org>
> > wrote:
> >
> > > Hi all,
> > >
> > > I have looked into the DRILL-7785, and the problem is not in Drill, so
> it
> > > is not a blocker for the release.
> > > For more details please refer to my comment
> > > <
> >
> https://issues.apache.org/jira/browse/DRILL-7785?focusedCommentId=17187629=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17187629
> > >
> > > on this ticket.
> > >
> > > Kind regards,
> > > Volodymyr Vysotskyi
> > >
> > >
> > > On Mon, Aug 31, 2020 at 4:26 AM Abhishek Girish 
> > > wrote:
> > >
> > >> Yup we can certainly include it if RC0 fails. So far I’m inclined to
> not
> > >> consider it a blocker. I’ve requested Vova and Anton to take a look.
> > >>
> > >> So folks, please continue to test the candidate.
> > >>
> > >> On Sun, Aug 30, 2020 at 6:16 PM Charles Givre 
> wrote:
> > >>
> > >> > Ok.  Are you looking to include DRILL-7785?  I don't think it's a
> > >> blocker,
> > >> > but if we find anything with RC0... let's make sure we get it in.
> > >> >
> > >> > -- C
> > >> >
> > >> >
> > >> >
> > >> > > On Aug 30, 2020, at 9:14 PM, Abhishek Girish 
> > >> wrote:
> > >> >
> > >> > >
> > >> >
> > >> > > Hey Charles,
> > >> >
> > >> > >
> > >> >
> > >> > > I would have liked to. We did get one of the PRs merged after the
> > >> master
> > >> >
> > >> > > branch was closed as I hadn't made enough progress with the
> release
> > >> yet.
> > >> >
> > >> > > But that’s not the case now.
> > >> >
> > >> > >
> > >> >
> > >> > > Unless DRILL-7781 is a release blocker, we should probably skip
> it.
> > So
> > >> > far,
> > >> >
> > >> > > a lot of effort has gone into getting RC0 ready. So I'm hoping to
> > get
> > >> > this
> > >> >
> > >> > > closed asap.
> > >> >
> > >> > >
> > >> >
> > >> > > Regards,
> > >> >
> > >> > > Abhishek
> > >> >
> > >> > >
> > >> >
> > >> > > On Sun, Aug 30, 2020 at 6:07 PM Charles Givre 
> > >> wrote:
> > >> >
> > >> > >
> > >> >
> > >> > >> HI Abhishek,
> > >> >
> > >> > >>
> > >> >
> > >> > >> Can we merge DRILL-7781?  We really shouldn't ship something
> with a
> > >> > simple
> > >> >
> > >> > >> bug like this.
> > >> >
> > >> > >>
> > >> >
> > >> > >> -- C
> > >> >
> > >> > >>
> > >> >
> > >> > >>
> > >> >
> > >> > >>
> > >> >
> > >> > >>
> > >> >
> > >> > >>
> > >> >
> > >> > >>> On Aug 30, 2020, at 8:40 PM, Abhishek Girish <
> agir...@apache.org>
> > >> > wrote:
> > >> >
> > >> > >>
> > >> >
> > >> > >>>
> > >> >
> > >> > >>
> > >> >
> > >> > >>> Advanced tests from [5] are also complete. All 7500+ tests
> passed,
> > >> > except
> > >> >
> > >> > >>
> > >> >
> > >> > >>> for a few relating to known resource issues (drillbit
> > connectivity /
> > >> > OOM
> > >> >
> > >> > >>
> > >> >
> > >> > >>> /...). Plus a few with the same symptoms as DRILL-7785.
> > >> >
> > >> > >>
> > >> >
> > >> > >>>
> > >> >
> > >> > >>
> > >> >
> > >> > >>> On Sun, Aug 30, 2020 at 2:17 PM Abhishek Girish <
> > agir...@apache.org
> > >> >
> > >> >
> > >> > >> wrote:
> > >> >
> > >> > >>
> > >> >
> > >> > >>>
> > >> >
> > >> > >>
> > >> >
> > >> >  Wanted to share an update on some of the testing I've done from
> > my
> > >> > side:
> > >> >
> > >> > >>
> > >> >
> > >> > 
> > >> >
> > >> > >>
> > >> >
> > >> >  All Functional tests from [5] (plus private Customer tests) are
> > >> >
> > >> > >> complete.
> > >> >
> > >> > >>
> > >> >
> > >> >  10,000+ tests have passed. However, I did see an issue with
> Hive
> > >> ORC
> > >> >
> > >> > >> tables
> > >> >
> > >> > 

[jira] [Created] (DRILL-7734) Revise the result set reader

2020-05-05 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7734:
--

 Summary: Revise the result set reader
 Key: DRILL-7734
 URL: https://issues.apache.org/jira/browse/DRILL-7734
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Updates to the {{ResultSetReader}} abstractions to make them usable in more 
cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7733) Use streaming for REST JSON queries

2020-05-05 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7733:
--

 Summary: Use streaming for REST JSON queries
 Key: DRILL-7733
 URL: https://issues.apache.org/jira/browse/DRILL-7733
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Several uses on the user and dev mail lists have complained about the memory 
overhead when running a REST JSON query: {{http:://node:8047/query.json}}. The 
current implementation buffers the entire result set in memory, then lets 
Jersey/Jetty convert the results to JSON. The result is very heavy heap use for 
larger query result sets.

This ticket requests a change to use streaming. As each batch arrives at the 
Screen operator, convert that batch to JSON and directly stream the results to 
the client network connection, much as is done for the native client connection.

For backward compatibility, the form of the JSON must be the same as the 
current API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7729) Use java.time in column accessors

2020-05-04 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7729:
--

 Summary: Use java.time in column accessors
 Key: DRILL-7729
 URL: https://issues.apache.org/jira/browse/DRILL-7729
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Use {{java.time}} classes in the column accessors, except for {{Interval}}, 
which has no {{java.time}} equivalent. Doing so allows us to create a row-set 
version of Drill's JSON writer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Drill with No-SQL [was: Cannot Build Drill "exec/Java Execution Engine"]

2020-05-03 Thread Paul Rogers
Thanks for the update; I hadn't picked up on that bit of confusion about Presto.

I just did a Drill build, seemed to work, thanks for the fix. However, I don't 
know if I had the needed dependency cached, so my build might have worked 
anyway...

Thanks,
- Paul

 

On Sunday, May 3, 2020, 3:09:58 PM PDT, Ted Dunning  
wrote:  
 
 I didn't mention Presto on purpose. It is a fine tool, but the community is
plagued lately by a fork. That can be expected to substantially inhibit
adoption and I think that is just what I have seen. It used to be that
people asked about Presto every other time I was on a call and I haven't
heard even one such question in over a year. The community may recover from
this, but it is hard to say whether they can regain their momentum.

In case anybody wants to sample the confusion, here are the two "official"
homes on github:

https://github.com/prestodb/presto
https://github.com/prestosql/presto

The worst part is that neither fork seems to dominate the other. With the
Hudson/Jeeves fork, at least, Hudson basically dies while Jenkins continued
with full momentum. Here, both sides seem to be splitting things much too
evenly.



On Sun, May 3, 2020 at 2:42 PM Paul Rogers 
wrote:

> Hi Tug,
>
> Glad to hear from you again. Ted's summary is pretty good; here's a bit
> more detail.
>
>
> Presto is another alternative which seems to have gained the most traction
> outside of the Cloud ecosystem on the one hand, and the
> Cloudera/HortonWorks ecosystem on the other. Presto does, however, demand
> that you have a schema, which is often an obstacle for many applications.
>
> Most folks I've talked to who tried to use Spark for this use case came
> away disappointed. Unlike Drill (or Presto or Impala), Spark wants to start
> new Java processes for each query. Makes great sense for large, complex
> map/reduce jobs, but is a non-starter for small, interactive queries.
>
> Hive also is trying to be an "uber query layer" and has integrations with
> multiple systems. But, Hive's complexity makes Drill look downright simple
> by comparison. Hive also needs an up-front schema.
>
>
> I've had the opportunity to integrate Drill with two different noSQL
> engines. Getting started is easy, especially if a REST or similar API is
> available. Filter push-down is the next step as otherwise Drill will simply
> suck all data from your DB as it it were a file. We've added some structure
> in the new HTTP reader to make it a bit easier than it used to be to create
> this kind of filter push-down. (The other kind of filter push-down is for
> partition pruning used for files, which you probably won't need.)
>
> Aside from the current MapR repo issues, Drill tends to be much easier to
> build than other systems. Pretty much set up Java and the correct Maven and
> you're good to go. If you run unit tests, there is one additional library
> to install, but the tests themselves tell you you exactly what is needed
> when they fail the first time (which I how I learned about it.)
>
>
> After that, performance will point the way. For example, does your DB have
> indexes? If so, then you can leverage the work originally done for MapR-DB
> to convey index information to Calcite so it can pick the best execution
> plan. There are specialized operators for index key lookup as well.
>
> All this will get you the basic one-table scan which is often all that
> no-SQL DBs ever need. (Any structure usually appears within each document,
> rather than as joined table as in the RDBMS world.) However, if your DB
> does need joins, you will need something like Calcite to work out the
> tradeoffs of the various join+filter-push plans possible, especially if
> your DB supports multiple indexes. There is no escaping the plan-time
> complexity of these cases. Calcite is big and complex, but it does give you
> the tools needed to solve these problems.
>
> If your DB is to be used to power dashboards (summaries of logs, time
> series, click streams, sales or whatever), you'll soon find you need to
> provide a caching/aggregation layer to avoid banging on your DB each time
> the dashboard refreshes. (Imagine a 1-week dashboard, updated every minute,
> where only the last hour has new data.) Drill becomes very handy as a way
> of combining data from a mostly-static caching layer (data for the last 6
> days, say) with your live DB (for the last one day, say.)
>
> If you provide a "writer" as well as a "reader", you can use Drill to load
> your DB as well as query it.
>
>
> Happy to share whatever else I might have learned if you can describe your
> goals in a bit more detail.
>
> Thanks,
> - Paul
>
>
>
>    On Sunday, May 3, 2020, 11:25:11 AM PDT, Ted Dunning <
> ted.dunn...@gmai

Drill with No-SQL [was: Cannot Build Drill "exec/Java Execution Engine"]

2020-05-03 Thread Paul Rogers
Hi Tug,

Glad to hear from you again. Ted's summary is pretty good; here's a bit more 
detail.


Presto is another alternative which seems to have gained the most traction 
outside of the Cloud ecosystem on the one hand, and the Cloudera/HortonWorks 
ecosystem on the other. Presto does, however, demand that you have a schema, 
which is often an obstacle for many applications.

Most folks I've talked to who tried to use Spark for this use case came away 
disappointed. Unlike Drill (or Presto or Impala), Spark wants to start new Java 
processes for each query. Makes great sense for large, complex map/reduce jobs, 
but is a non-starter for small, interactive queries.

Hive also is trying to be an "uber query layer" and has integrations with 
multiple systems. But, Hive's complexity makes Drill look downright simple by 
comparison. Hive also needs an up-front schema.


I've had the opportunity to integrate Drill with two different noSQL engines. 
Getting started is easy, especially if a REST or similar API is available. 
Filter push-down is the next step as otherwise Drill will simply suck all data 
from your DB as it it were a file. We've added some structure in the new HTTP 
reader to make it a bit easier than it used to be to create this kind of filter 
push-down. (The other kind of filter push-down is for partition pruning used 
for files, which you probably won't need.)

Aside from the current MapR repo issues, Drill tends to be much easier to build 
than other systems. Pretty much set up Java and the correct Maven and you're 
good to go. If you run unit tests, there is one additional library to install, 
but the tests themselves tell you you exactly what is needed when they fail the 
first time (which I how I learned about it.)


After that, performance will point the way. For example, does your DB have 
indexes? If so, then you can leverage the work originally done for MapR-DB to 
convey index information to Calcite so it can pick the best execution plan. 
There are specialized operators for index key lookup as well.

All this will get you the basic one-table scan which is often all that no-SQL 
DBs ever need. (Any structure usually appears within each document, rather than 
as joined table as in the RDBMS world.) However, if your DB does need joins, 
you will need something like Calcite to work out the tradeoffs of the various 
join+filter-push plans possible, especially if your DB supports multiple 
indexes. There is no escaping the plan-time complexity of these cases. Calcite 
is big and complex, but it does give you the tools needed to solve these 
problems.

If your DB is to be used to power dashboards (summaries of logs, time series, 
click streams, sales or whatever), you'll soon find you need to provide a 
caching/aggregation layer to avoid banging on your DB each time the dashboard 
refreshes. (Imagine a 1-week dashboard, updated every minute, where only the 
last hour has new data.) Drill becomes very handy as a way of combining data 
from a mostly-static caching layer (data for the last 6 days, say) with your 
live DB (for the last one day, say.)

If you provide a "writer" as well as a "reader", you can use Drill to load your 
DB as well as query it.


Happy to share whatever else I might have learned if you can describe your 
goals in a bit more detail.

Thanks,
- Paul

 

On Sunday, May 3, 2020, 11:25:11 AM PDT, Ted Dunning 
 wrote:  
 
 The compile problem is a problem with the MapR repo (I think). I have
reported it to the folks who can fix it.

Regarding the generic question, I think that Drill is very much a good
choice for putting a SQL layer on a noSQL database.

It is definitely the case that the community is much broader than it used
to be. A number of companies now use Drill in their products which is
one of the best ways to build long-term community.

There are alternatives, of course. All have trade-offs (because we live in
the world):

- Calcite itself (what Drill uses as a SQL parser and optimizer) can be
used, but you have to provide an execution framework and you wind up with
something that only works for your engine and is unlikely to support
parallel operations. Calcite is used by lots of projects, though, so it is
has a very broad base of support.

- Spark SQL is fairly easy to extend (from what I hear from friends) but
the optimizer doesn't deal well with complicated tradeoffs (precisely
because it is fairly simple). You also wind up with the baggage of spark
which could be good or bad. You would get some parallelism, though. I don't
think that Spark SQL handles complex objects, however.

- Postgres has a long history of having odd things grafted onto it. I know
little about this other than seeing the results. Extending Postgres would
not likely give you any parallelism, but there might be a way to support
complex objects through Postgres JSON object support.




On Sun, May 3, 2020 at 11:09 AM Tugdual Grall  wrote:

> Hello
>
> It has been a long time since I used 

[jira] [Created] (DRILL-7728) Drill SPI framework

2020-05-01 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7728:
--

 Summary: Drill SPI framework
 Key: DRILL-7728
 URL: https://issues.apache.org/jira/browse/DRILL-7728
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Provide the basic framework to load an extension in Drill, modelled after the 
Java Service Provider concept. Excludes full class loader isolation for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7725) Updates to EVF2

2020-04-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7725:
--

 Summary: Updates to EVF2
 Key: DRILL-7725
 URL: https://issues.apache.org/jira/browse/DRILL-7725
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Enhancements to the "version 2" of the "Enhanced Vector Framework" to prepare 
for upgrading the text reader to EVF2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7724) Refactor metadata controller batch

2020-04-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7724:
--

 Summary: Refactor metadata controller batch
 Key: DRILL-7724
 URL: https://issues.apache.org/jira/browse/DRILL-7724
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


A debugging session revealed opportunities to simplify 
{{MetadataControllerBatch}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7717) Support Mongo extended types in V2 JSON loader

2020-04-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7717:
--

 Summary: Support Mongo extended types in V2 JSON loader
 Key: DRILL-7717
 URL: https://issues.apache.org/jira/browse/DRILL-7717
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Drill supports Mongo's extended types in the V1 JSON reader. Add similar 
support to the V2 version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [QUESTION]: httpclient dependency

2020-04-23 Thread Paul Rogers
Hi All,
I think there may be a bit of confusion. It may be true that some of Drill's 
dependencies now use the newer version of the library 
httpcomponents:httpclient. However, it looks like ES directly depends on the 
older flavor.

We have pom file entries which exclude that old version. As a result, ES can't 
find the older version at run time.

So, maybe two choices. 1) convince ES to use the newer version (how?), or 2 
retain the older version (don't exclude it.)

Would sure be nice if each storage plugin could run in its own class loader to 
avoid these issues. We're slowly moving in that direction.

Thanks,
- Paul

 

On Thursday, April 23, 2020, 5:35:55 PM PDT, Charles Givre 
 wrote:  
 
 Hi Vova, 
Thanks for the response.  I've been slowly poking at a storage plugin for 
ElasticSearch.[1]  I was going to do some work on it, but after rebasing to the 
latest master, I'm getting errors in my unit tests that were not there before. 

Here's the relevant snippet of the dependency tree:

[INFO] +- org.elasticsearch.client:elasticsearch-rest-client:jar:7.6.2:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.5.12:compile
[INFO] |  +- org.apache.httpcomponents:httpcore:jar:4.4.12:compile
[INFO] |  \- org.apache.httpcomponents:httpcore-nio:jar:4.4.12:compile
[INFO] +- org.elasticsearch:elasticsearch-hadoop:jar:7.6.2:compile

Currently, I don't have the dependency excluded or anything like that in the 
pom.xml for the storage plugin so I would assume that the dependency would be 
included, but it doesn't seem to be. Would you have any suggestions as to how 
to fix it?


Thanks,
-- C



Here's the full stack trace:
19:58:51.940 [Time-limited test] DEBUG o.a.d.e.s.e.TestElasticQueries - select 
* from elasticsearch.employee.`developer`
19:58:52.011 [215dd444-5b82-78b6-6222-25db75b0a934:foreman] DEBUG 
o.a.d.e.s.e.ElasticSearchGroupScan - Getting region locations

org.apache.drill.exec.rpc.RpcException: 
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
ClassNotFoundException: 
org.apache.commons.httpclient.protocol.ProtocolSocketFactory


Please, refer to logs for more information.

[Error Id: f43ab4ec-d557-45df-a41d-1a0221c3ffdb on 192.168.1.25:31013]

    at org.apache.drill.exec.rpc.RpcException.mapException(RpcException.java:59)
    at 
org.apache.drill.exec.client.DrillClient$ListHoldingResultsListener.getResults(DrillClient.java:881)
    at org.apache.drill.exec.client.DrillClient.runQuery(DrillClient.java:583)
    at org.apache.drill.test.QueryBuilder.results(QueryBuilder.java:331)
    at 
org.apache.drill.test.ClusterFixture$FixtureTestServices.testRunAndReturn(ClusterFixture.java:615)
    at 
org.apache.drill.test.DrillTestWrapper.testRunAndReturn(DrillTestWrapper.java:938)
    at 
org.apache.drill.test.DrillTestWrapper.compareUnorderedResults(DrillTestWrapper.java:533)
    at org.apache.drill.test.DrillTestWrapper.run(DrillTestWrapper.java:172)
    at org.apache.drill.test.TestBuilder.go(TestBuilder.java:145)
    at 
org.apache.drill.exec.store.elasticsearch.TestElasticQueries.testSimpleStarQuery(TestElasticQueries.java:83)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.drill.common.exceptions.UserRemoteException: SYSTEM 
ERROR: ClassNotFoundException: 
org.apache.commons.httpclient.protocol.ProtocolSocketFactory


Please, refer to logs for more information.

[Error Id: f43ab4ec-d557-45df-a41d-1a0221c3ffdb on 192.168.1.25:31013]
    at 
org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:125)
    at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:422)
    at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:96)
    at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:273)
    at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:243)
    at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    at 

Format plugin configs should be immutable

2020-04-19 Thread Paul Rogers
Hi All,

This is a quick note for any of you who create or work on format plugins in 
Drill. You will see that all existing plugins have been modified so that config 
properties are immutable. This note will explain why.

Drill uses storage and format plugins as keys into an internal map. (That's 
right: the whole plugin config is the key, not just the name.) As you might 
expect, things get out of sync if we change the fields used to compute the key 
hash. So, all storage and format plugin config fields must be immutable.

As it turns out, Drill has a very helpful feature: table functions which allow 
you to set format plugin properties within your SQL query. That code has 
historically required that all your format plugin fields be both public and 
mutable. (The implementation wrote directly to these public fields.)

As you can see, that created a contradiction: internal maps demand immutable 
fields, table functions want mutable fields. DRILL-6168 solves this by using a 
different way (based on JSON serialization) to implement table functions. (That 
change allow allows table functions to inherit properties from the base config 
stored in ZK.) This fix allowed us to change all existing format plugin configs 
to have immutable properties.


What this means for you is that, in your new code, please follow the patterns 
in the latest master: please make storage and plugin fields immutable.

Thanks,
- Paul



[jira] [Created] (DRILL-7711) Add data path, parameter filter pushdown to HTTP plugin

2020-04-18 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7711:
--

 Summary: Add data path, parameter filter pushdown to HTTP plugin
 Key: DRILL-7711
 URL: https://issues.apache.org/jira/browse/DRILL-7711
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Add to the new HTTP plugin two new features:

 * The ability to express a path to the data to avoid having to work with 
complex message objects in SQL.
 * The ability to specify HTTP parameters using filter push-downs from SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7709) CTAS as CSV creates files which the "csv" plugin can't read

2020-04-17 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7709:
--

 Summary: CTAS as CSV creates files which the "csv" plugin can't 
read
 Key: DRILL-7709
 URL: https://issues.apache.org/jira/browse/DRILL-7709
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers


Change the output format to JSON and create a CSV file:
{noformat}
ALTER SESSION SET `store.format` = 'csv';
CREATE TABLE foo AS ...
 {noformat}

You will end up with a directory "foo" that contains a CSV file: "0_0_0.csv". 
Now, try to query that file:

{noformat}
SELECT * FROM foo
{noformat}

The query will fail, or return incorrect results, because in Drill, the "csv" 
read format is CSV *without* headers. But, on write, "csv" is CSV *with* 
headers.

The (very messy) workaround is to manually rename all the files to use the 
".csvh" suffix, or to create a separate storage plugin config for that target 
with a new "csv" format plugin that does not have headers.

Expected that if I create a file in Drill I should be able to immediately read 
that file without extra hokey-pokey.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS]: Masking Creds in Query Plans

2020-04-17 Thread Paul Rogers
Hi Charles,

Excellent point. The problem is deeper. Drill serializes plugin configs in the 
query plan which it sends to each worker (Drillbit.) Why? To avoid race 
conditions if you start a query then change the plugin config and thus 
different nodes see different versions of the config.

Masking can't happen in the execution plan or the plan won't work. (I hope your 
password is not actually "***".) So, masking would have to happen in logs 
and in the EXPLAIN PLAN FOR. This would, in turn, require that we have code 
that understands each config well enough to make a copy of the config with the 
credentials masked so we can then serialize the copied plan to JSON. (Or, we'd 
have to edit the JSON after generated.) Both are pretty ugly and not very 
secure.

What we need is some kind of "vault" interface: a config which is a key into a 
vault where Drill itself has been given the key, and the vault returns the 
actual credential value. As a security guy yourself, what would you recommend 
as our target? Should we create a generic API? Is there some system common 
enough on Hadoop systems that we should target that as our reference 
implementation? Also, can you perhaps file a JIRA ticket for this issue?

Thanks,
- Paul

 

On Friday, April 17, 2020, 7:34:32 AM PDT, Charles Givre  
wrote:  
 
 Hello all, 
I was thinking about this, if a user were to execute an EXPLAIN PLAN FOR query, 
they get a lot of information about the storage plugin, including in some cases 
creds.
The example below shows a query plan for the JDBC storage plugin.  As you can 
see, the user creds are right there 

I'm wondering would it be advisable or possible to mask the creds in query 
plans so that users can't access this information?  If masking it isn't an 
option, is there some other way to prevent users from seeing this information?  
In a multi-tenant environment, it seems like a rather large security hole. 
Thanks,
-- C


{
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    },
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ ],
    "queue" : 0,
    "hasResourcePlan" : false,
    "resultMode" : "EXEC"
  },
  "graph" : [ {
    "pop" : "jdbc-scan",
    "@id" : 5,
    "sql" : "SELECT *\nFROM `stats`.`batting`",
    "columns" : [ "`playerID`", "`yearID`", "`stint`", "`teamID`", "`lgID`", 
"`G`", "`AB`", "`R`", "`H`", "`2B`", "`3B`", "`HR`", "`RBI`", "`SB`", "`CS`", 
"`BB`", "`SO`", "`IBB`", "`HBP`", "`SH`", "`SF`", "`GIDP`" ],
    "config" : {
      "type" : "jdbc",
      "driver" : "com.mysql.cj.jdbc.Driver",
      "url" : "jdbc:mysql://localhost:3306/?serverTimezone=EST5EDT",
      "username" : "",
      "password" : "",
      "caseInsensitiveTableNames" : false,
      "sourceParameters" : { },
      "enabled" : true
    },
    "userName" : "",
    "cost" : {
      "memoryCost" : 1.6777216E7,
      "outputRowCount" : 100.0
    }
  }, {
    "pop" : "limit",
    "@id" : 4,
    "child" : 5,
    "first" : 0,
    "last" : 10,
    "initialAllocation" : 100,
    "maxAllocation" : 100,
    "cost" : {
      "memoryCost" : 1.6777216E7,
      "outputRowCount" : 10.0
    }
  }, {
    "pop" : "limit",
    "@id" : 3,

  

[jira] [Created] (DRILL-7708) Downgrade maven from 3.6.3 to 3.6.0

2020-04-17 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7708:
--

 Summary: Downgrade maven from 3.6.3 to 3.6.0
 Key: DRILL-7708
 URL: https://issues.apache.org/jira/browse/DRILL-7708
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


DRILL-7704 upgraded Drill's Maven version to 3.6.3.


As it turns out, I use Ubuntu (Linux Mint) for development. Maven is installed 
as a package using apt-get. Packages can lag behind a bit. The latest maven 
available via apt-get is 3.6.0.


It is a nuisance to install a new version outside the package manager. I 
changed the Maven version in the root pom.xml to 3.6.0 and the build seemed to 
work. Any reason we need the absolute latest version rather than just 3.6.0 or 
later?


The workaround for now is to manually edit the pom.xml file on each checkout, 
then revert the change before commit. This ticket requests to adjust the 
"official" version to 3.6.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [NOTICE] Maven 3.6.3

2020-04-17 Thread Paul Rogers
Hi Arina,

Thanks for keeping us up to date!

As it turns out, I use Ubuntu (Linux Mint) for development. Maven is installed 
as a package using apt-get. Packages can lag behind a bit. The latest maven 
available via apt-get is 3.6.0.

It is a nuisance to install a new version outside the package manager. I 
changed the Maven version in the root pom.xml to 3.6.0 and the build seemed to 
work. Any reason we need the absolute latest version rather than just 3.6.0 or 
later?

The workaround for now is to manually edit the pom.xml file on each checkout, 
then revert the change before commit. Can we maybe adjust the "official" 
version instead?


Thanks,
- Paul

 

On Friday, April 17, 2020, 5:09:49 AM PDT, Arina Ielchiieva 
 wrote:  
 
 Hi all,

Starting from Drill 1.18.0 (and current master from commit 20ad3c9 [1]), Drill 
build will require Maven 3.6.3, otherwise build will fail.
Please make sure you have Maven 3.6.3 installed on your environments. 

[1] 
https://github.com/apache/drill/commit/20ad3c9837e9ada149c246fc7a4ac1fe02de6fe8

Kind regards,
Arina  

[jira] [Resolved] (DRILL-7655) Add Default Schema text box to Edit Query page in query profile

2020-04-15 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7655.

Resolution: Fixed

Fixed as part of PR #2052.

> Add Default Schema text box to Edit Query page in query profile
> ---
>
> Key: DRILL-7655
> URL: https://issues.apache.org/jira/browse/DRILL-7655
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.18.0
>Reporter: Vova Vysotskyi
>    Assignee: Paul Rogers
>Priority: Major
> Fix For: Future
>
> Attachments: image-2020-03-21-01-44-15-062.png, 
> image-2020-03-21-01-44-57-172.png, image-2020-03-21-01-45-24-782.png
>
>
> In DRILL-7603 was added functionality to specify default schema for query in 
> Drill Web UI when submitting the query.
> Also, the query may be resubmitted from the profiles page, and for the case 
> when the query was submitted with specified default schema, its resubmission 
> will fail.
> The aim of this Jira is to add Default Schema text box to this page and 
> populate it with schema specified for the specific query if possible.
> !image-2020-03-21-01-44-15-062.png!
>  
> !image-2020-03-21-01-44-57-172.png!
>  
> !image-2020-03-21-01-45-24-782.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7703) Support for 3+D arrays in EVF JSON loader

2020-04-15 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7703:
--

 Summary: Support for 3+D arrays in EVF JSON loader
 Key: DRILL-7703
 URL: https://issues.apache.org/jira/browse/DRILL-7703
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Add support for multiple levels of repeated list to the new EVF-based JSON 
reader.

As work continues on adding the new JSON reader to Drill, running unit tests 
reveals that some include list with three (perhaps more) dimensions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7701) EVF V2 Scan Framework

2020-04-14 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7701:
--

 Summary: EVF V2 Scan Framework
 Key: DRILL-7701
 URL: https://issues.apache.org/jira/browse/DRILL-7701
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Scan framework for the "V2" EVF schema resolution committed in DRILL-7696.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7685) Case statement marking column as required in parquet metadata

2020-04-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7685.

Resolution: Cannot Reproduce

Tested in Drill 1.18 (snapshot) and found that the provided query works fine. 
Suggested the user try the newer Drill version.

If you still have a problem please reopen this bug and provide another example 
so we can locate and fix the issue, if it still exists in the latest code.

> Case statement marking column as required in parquet metadata
> -
>
> Key: DRILL-7685
> URL: https://issues.apache.org/jira/browse/DRILL-7685
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.16.0
>Reporter: Nitin Pawar
>Assignee: Paul Rogers
>Priority: Minor
>
> We use apache drill for multi step processing.
> In one of the steps we have query as below
> ~create table dfs.tmp.`/t2` as select employee_id, case when department_id is 
> not null then 1 else 2 end as case_output from cp.`employee.json`;~
> This provides output as 
> employee_id: OPTIONAL INT64 R:0 D:1
> case_output: REQUIRED INT32 R:0 D:0
> If we remove the end statement from case it does mark the column as optional.
>  
> We feed this output to covariance function and because of this we get an 
> error like below 
> Error: Missing function implementation: [covariance(BIGINT-OPTIONAL, 
> INT-REQUIRED)]. Full expression: --UNKNOWN EXPRESSION--
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7697) Revise query editor in profile page of web UI

2020-04-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7697:
--

 Summary: Revise query editor in profile page of web UI
 Key: DRILL-7697
 URL: https://issues.apache.org/jira/browse/DRILL-7697
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Drill has two separate query editors:

* The one displayed from the Query tab
* The one displayed from the Edit Query tab within Profiles

The two editors do basically the same thing, but have evolved as copies that 
have diverged.

* The Query tab editor places the three query types above the query text box, 
while the Profiles version puts the same control below the query text box.
* Similarly the Query tab editor puts the Ctrl+Enter hint above the text box, 
Profiles puts it below.

A first request is to unify the two editors. In particular, move the code to a 
common template file included in both places.

Second, the Profiles editor is a bit redundant.

* Displays a "Cancel Query" button even if the query is completed. Hide this 
button for completed queries. (Since there is a race condition, hide it for 
queries completed at the time the page was created.)
* No need to ask the user for the query type. The profile should include the 
type and the type should be a fixed field in Profiles.
* Similarly, the limit and (in Drill 1.18) the Default Schema should also be 
recorded in the query plan and fixed.

Finally, since system/session options can affect a query, and are part of the 
query plan, show those in the query as well so it can be rerun in the same 
environment in which it originally ran.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-6672) Drill table functions cannot handle "setFoo" accessors

2020-04-11 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-6672.

Resolution: Not A Problem

Storage and format plugins must be immutable since their entire values are used 
as keys in an internal map (plugin registry and format plugin tables.) So, no 
config should have a "setFoo()" method.

> Drill table functions cannot handle "setFoo" accessors
> --
>
> Key: DRILL-6672
> URL: https://issues.apache.org/jira/browse/DRILL-6672
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>    Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Consider an example format plugin, such as the regex one used in the Drill 
> book. (GitHub reference needed.) We can define the plugin using getters and 
> setters like this:
> {code}
> public class RegexFormatConfig implements FormatPluginConfig {
>   private String regex;
>   private String fields;
>   private String extension;
>   public void setRegex(String regex) { this.regex = regex; }
>   public void setFields(String fields) { this.fields = fields; }
>   public void setExtension(String extension) { this.extension = extension; }
> {code}
> We can then create a plugin configuration using the Drill Web console, the 
> {{bootstrap-storage-plugins.json}} and so on. All work fine.
> Suppose we try to define a configuration using a Drill table function:
> {code}
>   final String sql = "SELECT * FROM table(cp.`regex/simple.log2`\n" +
>   "(type => 'regex',\n" +
>   " extension => 'log2',\n" +
>   " regex => '(dddd)-(dd)-(dd) 
> .*',\n" +
>   " fields => 'a, b, c, d'))";
> {code}
> We get this error:
> {noformat}
> org.apache.drill.common.exceptions.UserRemoteException: PARSE ERROR: 
> can not set value (\d\d\d\d)-(\d\d)-(\d\d) .* to parameter regex: class 
> java.lang.String
> table regex/simple.log2
> parameter regex
> {noformat}
> The reason is that the code that handles table functions only knows how to 
> set public fields, it does not know about the Java Bean getter/setter 
> conventions used by Jackson:
> {code}
> package org.apache.drill.exec.store.dfs;
> ...
> final class FormatPluginOptionsDescriptor {
>   ...
>   FormatPluginConfig createConfigForTable(TableInstance t) {
> ...
> Field field = pluginConfigClass.getField(paramDef.name);
> ...
> }
> field.set(config, param);
>   } catch (IllegalAccessException | NoSuchFieldException | 
> SecurityException e) {
> throw UserException.parseError(e)
> .message("can not set value %s to parameter %s: %s", param, 
> paramDef.name, paramDef.type)
> ...
> {code}
> The only workaround is to make all fields public:
> {code}
> public class RegexFormatConfig implements FormatPluginConfig {
>   public String regex;
>   public String fields;
>   public String extension;
> {code}
> Since public fields are not good practice, please modify the table function 
> mechanism to follow Jackson conventions and allow Java Bean style setters. 
> (Or better, fix DRILL-6673 to allow immutable format objects via the use of a 
> constructor.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7696) EVF v2 Scan Schema Resolution

2020-04-09 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7696:
--

 Summary: EVF v2 Scan Schema Resolution
 Key: DRILL-7696
 URL: https://issues.apache.org/jira/browse/DRILL-7696
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Revises the mechanism EVF uses to resolve the schema for a scan. See PR for 
details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7690) Display (major) operators in fragment title bar in Web UI

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7690:
--

 Summary: Display (major) operators in fragment title bar in Web UI
 Key: DRILL-7690
 URL: https://issues.apache.org/jira/browse/DRILL-7690
 Project: Apache Drill
  Issue Type: Improvement
  Components: Web Server
Affects Versions: 1.17.0
Reporter: Paul Rogers


Run a query in the Drill Web Console. View the profile, Query tab. Scroll down 
to the list of fragments. You'll see a gray bar with a title such as

Major Fragment: 02-xx-xx

This section shows the timing of the fragments.

But, what is happening in this fragment? To find out we must scroll way down to 
the lower section where we see:


02-xx-00 - SINGLE_SENDER
02-xx-01 - SELECTION_VECTOR_REMOVER
02-xx-02 - LIMIT
02-xx-03 - SELECTION_VECTOR_REMOVER
02-xx-04 - TOP_N_SORT
02-xx-05 - UNORDERED_RECEIVER

The result is quite a bit of scroll down/scroll up.

This ticket asks to show the major operators in the fragment title. For 
example, for the above:

Major Fragment: 02-xx-xx (TOP_N_SORT, LIMIT)

The "minor" operators which are omitted (because they are not the focus of the 
fragment) include senders, receivers and the SVR.

Note that the operators should appear in data flow order (bottom to top).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7689) Do not save profiles for trivial queries

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7689:
--

 Summary: Do not save profiles for trivial queries
 Key: DRILL-7689
 URL: https://issues.apache.org/jira/browse/DRILL-7689
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Drill saves a query profile for every query. Some queries are trivial; there is 
no useful information (for the user) in such queries. Examples include {{ALTER 
SESSION/SYSTEM}}, {{CREATE SCHEMA}}, and other internal commands.

Logic already exists to omit profiles for {{ALTER}} commands, but only if a 
session option is set. No ability exists to omit profiles for the other 
statements.

This ticket asks to:
 * Omit profiles for trivial commands by default. (Part of the task is to 
define the set of trivial commands.)
 * Provide an option to enable such profiles, primarily for use by developers 
when debugging the trivial commands.
 * If no profile is available, show a message to that effect in the Web UI 
where we currently display the profile number. Provide a link to the 
documentation page that explains why there is no profile (and how to use the 
above option to request a profile if needed.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7688) Provide web console option to see non-default options

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7688:
--

 Summary: Provide web console option to see non-default options
 Key: DRILL-7688
 URL: https://issues.apache.org/jira/browse/DRILL-7688
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


The Drill web console has evolved to become quite powerful. The Options page 
has many wonderful improvements over earlier versions. The "Default" button is 
a handy way to see which options have been set, and to reset options to their 
default values.

When testing and troubleshooting, it is helpful to identify those options which 
are not at their default values. Please add a filter at the top of the page for 
"non-default" in addition to the existing topic-based filters.

It may also be useful to add a bit more color to the "Default" button when an 
option is set. At present, the distinction is gray vs. black text which is 
better than it was. Would be better for there to be even more contrast so 
non-default values are easier to see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7687) Inaccurate memory estimates in hash join

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7687:
--

 Summary: Inaccurate memory estimates in hash join
 Key: DRILL-7687
 URL: https://issues.apache.org/jira/browse/DRILL-7687
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.15.0
Reporter: Paul Rogers


See DRILL-7675. In this ticket, we tried to reproduce an OOM case in the 
partition sender. In so doing, we mucked with various parallelization options. 
The query has 2 MB of data, but at one point the query would fail to run 
because the hash join could not obtain enough memory (on a system with 8 GB of 
memory available.)

The problem is that the memory calculator sees a worst-case scenario: a row 
with 250+ columns. The hash join estimated it needed something like 650MB of 
memory to perform the join. (That is 650 MB per fragment, and there were 
multiple fragments.) Since there was insufficient memory, and the 
{{drill.exec.hashjoin.fallback.enabled}} option was disabled, the hash join 
failed before it even started.

Better would be to at least try the query. In this case, with 2MB of data, the 
query succeeds. (Had to enable the magic option to do so.)

Better also would be to use the estimated row counts when estimating memory 
use. Maybe better estimates for the amount of memory needed per row. (The data 
in question has multiple nested map arrays, causing cardinality estimates to 
grow by 5x at each level.)

Perhaps use the "batch sizing" mechanism to detect actual memory use by 
analyzing the incoming batch.

There is no obvious answer. However, the goal is clear: the query should 
succeed if the actual memory needed fits within that available; we should not 
fail proactively based on estimates of needed memory. (This what the 
{{drill.exec.hashjoin.fallback.enabled}} option does; perhaps it should be on 
by default.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7686) Excessive memory use in partition sender

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7686:
--

 Summary: Excessive memory use in partition sender
 Key: DRILL-7686
 URL: https://issues.apache.org/jira/browse/DRILL-7686
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.14.0
Reporter: Paul Rogers


The Partition Sender in Drill is responsible to take a batch from fragment x, 
and send its rows to all other fragments f1, f2, ... fn. For example, when 
joining, fragment x might read from a portion of a file, hash the join key, and 
partition rows by hash key to the receiving fragments that join rows with that 
same key.

Since Drill is columnar, the sender needs to send a batch of columns to each 
receiver. To be efficient, that batch should contain a reasonable number of 
rows. The current default is 1024.

Drill creates buffers, one per sender, to gather the rows. Thus, each sender 
needs n buffers: one for each receiver.

Because Drill is symmetrical, there are n senders (scans). Since each maintains 
n send buffers, we have a total of n^2 buffers. That is, the amount of memory 
used by the partition sender grows with the square of the degree of parallelism 
for a query.

In addition, as seen in DRILL-7675, the size of the buffers is controlled not 
by Drill, but by the incoming data. The query in DRILL-7675 had a row with 260+ 
fields, some of which were map arrays.

The result is that the query, which processes 2 MB of data, runs out of memory 
when may GB are available. Drill is simply doing the math: n^2 buffers, each 
with 1024 rows, each with 250 fields, many with a cardinality of 5x (or 25x or 
125x, depending on array depth) of the row count. The result is a very large 
memory footprint.

There is no simple bug-fix solution: the design is inherently unbounded. This 
ticket asks to develop a new design. Some crude ideas:
 * Use a row-based format for sending to avoid columnar overhead.
 * Send rows as soon as they are available on the sender side; allow the 
receiver to do buffering.
 * If doing buffering, flush rows after x ms to avoid slowing the system. (The 
current approach waits for buffers to fill.)
 * Consolidate buffers on each sending node. (This is the Mux/DeMux approach 
which is in the code, but was never well understood, and has its own 
concurrency, memory ownership problems.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7683) Add "message parsing" to new JSON loader

2020-03-31 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7683:
--

 Summary: Add "message parsing" to new JSON loader
 Key: DRILL-7683
 URL: https://issues.apache.org/jira/browse/DRILL-7683
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Worked on a project that uses the new JSON loader to parse a REST response that 
includes a set of "wrapper" fields around the JSON payload. Example:

{code:json}
{ "status": "ok", "results: [ data here ]}
{code}

To solve this cleanly, added the ability to specify a "message parser" to 
consume JSON tokens up to the start of the data. This parser can be written as 
needed for each different data source.

Since this change adds one more parameter to the JSON structure parser, added 
builders to gather the needed parameters rather than making the constructor 
even larger.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7680) Move UDF projects before plugins in contrib

2020-03-31 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7680:
--

 Summary: Move UDF projects before plugins in contrib
 Key: DRILL-7680
 URL: https://issues.apache.org/jira/browse/DRILL-7680
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Several {{contrib}} plugins depend on UDFs for testing. However, the UDFs occur 
after the plugins in build order. This PR reverses the dependencies so that 
UDFs are built before the plguins that want to use them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Excessive Memory Use in Parquet Files (From Drill Slack Channel)

2020-03-24 Thread Paul Rogers
red: One or more nodes ran out of memory while executing the 
query. (null)
org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or more 
nodes ran out of memory while executing the query.
null
[Error Id: 67b61fc9-320f-47a1-8718-813843a10ecc ]
    at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:657)
    at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:338)
    at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.drill.exec.exception.OutOfMemoryException: null
    at 
org.apache.drill.exec.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:59)
    at 
org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.allocateOutgoingRecordBatch(PartitionerTemplate.java:380)
    at 
org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.initializeBatch(PartitionerTemplate.java:400)
    at 
org.apache.drill.exec.test.generated.PartitionerGen5.setup(PartitionerTemplate.java:126)
    at 
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createClassInstances(PartitionSenderRootExec.java:263)
    at 
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createPartitioner(PartitionSenderRootExec.java:218)
    at 
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:188)
    at 
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:93)
    at 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:323)
    at 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:310)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:310)
    ... 4 common frames omitted
Now, I'm running this query from a 16 core, 32GB Ram machine, with Heap sized 
at 20GB, Eden sized at 16GB (added manually to JAVA_OPTS) and Direct Sized at 8 
GB.
By querying sys.memory I can confirm all limits apply. At no point throughout 
the query Am I nearing memory limit of the HEAP/DIRECT or the OS itself





8:25
However, due to the way 
org.apache.drill.exec.vector.complex.AbstractContainerVector.allocateNew is 
impelmented
8:27
@Override
  public void allocateNew() throws OutOfMemoryException {
    if (!allocateNewSafe()) {
      throw new OutOfMemoryException();
    }
  }
8:27
The actual exception/error is swallowed, and I have no idea what's the cause of 
the failure
8:28
The data-set itself consists of say 15 parquet files, each one weighing at 
about 100kb
8:30
but as mentioned earlier, the parquet files are a bit more complex than the 
usual.
8:32
@cgivre @Vova Vysotskyi is there anything I can do or tweak to make this error 
go away?

cgivre  8:40 AM
Hmm...
8:40
This may be a bug.  Can you create an issue on our JIRA board?

Idan Sheinberg  8:43 AM
Sure
8:43
I'll get to it

cgivre  8:44 AM
I'd like for Paul Rogers to see this as I think he was the author of some of 
this.

Idan Sheinberg  8:44 AM
Hmm. I'll keep that in mind

cgivre  8:47 AM
We've been refactoring some of the complex readers as well, so its possible 
that is caused this, but I'm not really sure.
8:47
What version of Drill?

cgivre  9:11 AM
This kind of info is super helpful as we're trying to work out all these 
details.
9:11
Reading schemas on the fly is not trivial, so when we find issues, we do like 
to resolve them

Idan Sheinberg  9:16 AM
This is drill 0.18 -SNAPSHOT as of last month
9:16
U
9:16
I do think I managed to resolve the issue however
9:16
I'm going to run some additional tests and let you know

cgivre  9:16 AM
What did you do?
9:17
You might want to rebase with today's build as well

Idan Sheinberg  9:21 AM
I'll come back with the details in a few moments

cgivre  9:38 AM
Thx
new messages

Idan Sheinberg  9:50 AM
Ok. See it seems as though it's a combination of a few things.
The data-set in question is still small (as mentioned before), but we are 
setting planner.slice_target  to an extremely low value in order to trigger 
parallelism and speed up parquet parsing by using multiple fragments.
We have 16 cores, 32 GB (C5.4xlarge on AWS) but we set 
planner.width.max_per_node  to further increase parallelism.  it seems as 
though each fragment is handling parquet parsing on it's own, and somehow 
incurs a great burden on
the direct memory buffer pool, as I do see 16GB peaks of direct memory usage 
after lowering the planner.width.max

[jira] [Created] (DRILL-7658) Vector allocateNew() has poor error reporting

2020-03-24 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7658:
--

 Summary: Vector allocateNew() has poor error reporting
 Key: DRILL-7658
 URL: https://issues.apache.org/jira/browse/DRILL-7658
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers


See posting by Charles on 2020-03-24 on the user and dev lists of a message 
forwarded from another user where a query ran out of memory. Stack trace:

{noformat}
Caused by: org.apache.drill.exec.exception.OutOfMemoryException: null
    at 
org.apache.drill.exec.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:59)
    at 
org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.allocateOutgoingRecordBatch(PartitionerTemplate.
{noformat}

Notice the complete lack of context. The method in question:

{code:java}
  public void allocateNew() throws OutOfMemoryException {
   if (!allocateNewSafe()) {
 throw new OutOfMemoryException();
 }
   }
{code}

A generated implementation of the {{allocateNewSafe()}} method:

{code:java}
  @Override
  public boolean allocateNewSafe() {
long curAllocationSize = allocationSizeInBytes;
if (allocationMonitor > 10) {
  curAllocationSize = Math.max(8, curAllocationSize / 2);
  allocationMonitor = 0;
} else if (allocationMonitor < -2) {
  curAllocationSize = allocationSizeInBytes * 2L;
  allocationMonitor = 0;
}

try{
  allocateBytes(curAllocationSize);
} catch (DrillRuntimeException ex) {
  return false;
}
return true;
  }
{code}

Note that the {{allocateNew()}} method is not "safe" (it throws an exception), 
but it does so by discarding the underlying exception. What should happen is 
that the "non-safe" {{allocateNew()}} should call the {{allocateBytes()}} 
method and simply forward the {{DrillRuntimeException}}. It probably does not 
do so because the author wanted to reuse the extra size calcs in 
{{allocateNewSafe()}}.

The solution is to put the calcs and the call to {{allocateBytes()}} in a 
"non-safe" method, and call that entire method from {{allocateNew()}} and 
{{allocateNewSafe()}}.  Or, better, generate {{allocateNew()}} using the above 
code, but have the base class define {{allocateNewSafe()}} as a wrapper.

Note an extra complexity: although the base class provides the method shown 
above, each generated vector also provides:

{code:java}
  @Override
  public void allocateNew() {
if (!allocateNewSafe()) {
  throw new OutOfMemoryException("Failure while allocating buffer.");
}
  }
{code}

Which is both redundant and inconsistent (one has a message, the other does 
not.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7640) EVF-based JSON Loader

2020-03-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7640:
--

 Summary: EVF-based JSON Loader
 Key: DRILL-7640
 URL: https://issues.apache.org/jira/browse/DRILL-7640
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Builds on the JSON structure parser and several other PRs to provide an 
enhanced, robust mechanism to read JSON data into value vectors via the EVF. 
This is not the JSON reader, rather it is the "V2" version of the 
\{{JsonProcessor}} which does the actual JSON parsing/loading work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7634) Rollup of code cleanup changes

2020-03-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7634:
--

 Summary: Rollup of code cleanup changes
 Key: DRILL-7634
 URL: https://issues.apache.org/jira/browse/DRILL-7634
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Pack of cosmetic code cleanup changes accumulated over recent months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7633) Fixes for union and repeated list accessors

2020-03-09 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7633:
--

 Summary: Fixes for union and repeated list accessors
 Key: DRILL-7633
 URL: https://issues.apache.org/jira/browse/DRILL-7633
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Minor fixes for repeated list and Union type support in column accessors



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7632) Improve user exception formatting

2020-03-09 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7632:
--

 Summary: Improve user exception formatting
 Key: DRILL-7632
 URL: https://issues.apache.org/jira/browse/DRILL-7632
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Modify the user exception to insert a colon between the "context" title and 
value. Old style:

{noformat}
My Context value
{noformat}

Revised:

{noformat}
My Context: value
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >