Hi All,
I agree on not mentioning Arrow integration; it is too early. As Igor says, it
would be a big project, done over many releases. Arrow points out the need for
other work that will occur shorter term.
A big challenge now is that our "crazy complex type" feature does not really
work. Yes, we have good support for a (single) Map, for (scalar) arrays, and
even arrays of maps. We've added DICT. However, things get crazy when we try to
do a UNION with a LIST of DICT that contains UNIONS of ...
It is VERY hard to reason about that stuff with Drill vectors. The crazy types
would eat up a huge amount of work as we try to convert, since the semantics
are not clear.
So, we'd probably want to stabilize our complex type support. Draw a line at
what we do and do not support. Make it work with vectors. Then, we'll be in a
better place if/when we change internal representations.
In fact, it may make sense to continue on the work we are already doing to add
type support. We've got Metastore and Provided Schema. EVF an work with a
schema. We need to start connecting these pieces.
One way to do that is to add classic type propagation support to the planner.
Calcite must support type analysis since Calcite is used in Hive and Hive
enforces a strict schema. Drill would want to support the current "dynamic"
(runtime) type as well as a schema-provided plan-time type.
Between fixing the "crazy type" problem, and supporting an (optional) schema,
we'd resolve quite a few of the nasty ambiguities which would otherwise
complicate any conversion of our internal storage representation.
Point is, maybe we should hold a few more hangouts to work out what we all want
to do next.
Thanks,
- Paul
On Friday, February 7, 2020, 9:56:27 AM PST, Igor Guzenko
<[email protected]> wrote:
Hello Charles,
To be honest, the Arrow to Drill comparison is very time consuming and
implementation + testing is much more time consuming than that. Along other
stoppers of the migration is a high risk to introduce new bugs. I'm not
trying to say that Arrow is unstable. It rather risks because I have a lack
of knowledge of all the low-level details of Drill's operators. Also in
Drill code, there are a lot of very special fixes for a big variety of
problems, and it is very easy to introduce breaking change without being
aware of fix reasons.
I had a conversation with Arina and we decided to abandon the migration for
an undefined time. Currently, I and Bohdan are focused on learning more
about the internals of Drill operators(codegen & batch execution) and
implementing INTERSECT/EXCEPT operators. Prior to any changes, we are
planning to create the design document and discuss implementation
approaches with the community.
Kind Regards,
Igor
On Fri, Feb 7, 2020 at 7:07 PM Charles Givre <[email protected]> wrote:
> Hi Igor,
> Thanks for your response. Regarding the release, my observation was that
> the last release was REALLY complicated and took a very long time because
> it was very large. My thought is that going forward, we can get more
> releases out the door if we do smaller, more frequent releases. This way
> we get more value into the hands of our users faster. I'm not tied to any
> date, but I'd like to shoot for quarterly releases rather than
> semi-annual. I'm happy to make the projected timeline more vague...
> Something like:
>
> "We are aspiring for more frequent releases and are aiming for early Q2
> for the next release"
>
>
> @Vova, @Volodmyr, @Arina, what are your thoughts?
>
> On another note, I realized this after I wrote it, but I didn't mention
> any of the conversation about Arrow. Do you think we should mention that
> as well?
> -- C
>
>
>
> > On Feb 7, 2020, at 11:15 AM, Igor Guzenko <[email protected]>
> wrote:
> >
> > Hello Charles,
> >
> > Thank you very much for gathering all the information necessary for the
> > report. Everything looks good for me except one thing I'm not sure
> about. I
> > have some doubts about the new planned release date, seems not very
> > realistic to me at the moment. I'm not a manager but I think it could be
> > useful to have a clear documented release strategy and define an
> > approximate list of features&fixes to be included. Sorry if it is not
> > purely related to the board report topic, I just shared my thoughts.
> >
> > Thanks,
> > Igor
> >
> > On Fri, Feb 7, 2020 at 4:54 PM Charles Givre <[email protected]> wrote:
> >
> >> Hello all,
> >> Here is the draft Apache Board report which is due on Wednesday. Could
> >> everyone please take a look and send me comments by Tuesday?
> >> Thanks!
> >> -- C
> >>
> >>
> >>
> >> ## Description:
> >> The mission of Drill is the creation and maintenance of software related
> >> to
> >> Schema-free SQL Query Engine for Apache Hadoop, NoSQL and Cloud Storage
> >>
> >> ## Issues:
> >> Nothing significant to report.
> >>
> >> ## Membership Data:
> >> Apache Drill was founded 2014-11-18 (5 years ago)
> >> There are currently 56 committers and 26 PMC members in this project.
> >> The Committer-to-PMC ratio is roughly 7:4.
> >>
> >> Community changes, past quarter:
> >> - Bohdan Kazydub was added to the PMC on 2020-01-28
> >> - Igor Guzenko was added to the PMC on 2019-12-12
> >> - Denys Ordynskiy was added as committer on 2019-12-26
> >>
> >> ## Project Activity:
> >> Drill 1.17 was released on 2019-12-26 which contains a significant
> number
> >> of
> >> bugfixes and improvements.
> >> (https://drill.apache.org/docs/apache-drill-1-17-0-release-notes/).
> >>
> >> The Drill Community had a Hangout meeting and will be working towards a
> >> number
> >> of strategic goals:
> >> 1. Increase the size of community
> >> 2. Reduce obstacles to use, such as improving documentation and website.
> >> 3. Work on publicity
> >>
> >> We have averaged about two releases per year. Going forward, we will
> try
> >> for
> >> smaller releases more frequently. Our next release is targeted for end
> of
> >> March.
> >>
> >> Interesting work underway:
> >> - Storage plugins for Apache Druid, Apache Cassandra, Elasticsearch, and
> >> general HTTP/REST.
> >> - Significant code improvements to facilitate storage and format plugin
> >> development.
> >> - Integrations with Docker and K8s.
> >> - Documentation improvements to include website re-work.
> >>
> >>
> >>
> >> ## Community Health:
> >> - [email protected] had a 35% increase in traffic in the past
> quarter
> >> (2169
> >> emails compared to 1606)
> >> - [email protected] had a 97% increase in traffic in the past
> quarter
> >> (231
> >> emails compared to 117)
> >> - 129 issues opened in JIRA, past quarter (28% increase)
> >> - 99 issues closed in JIRA, past quarter (15% increase)
> >> - 100 commits in the past quarter (78% increase)
> >> - 16 code contributors in the past quarter (6% increase)
> >> - 74 PRs opened on GitHub, past quarter (29% increase)
> >> - 75 PRs closed on GitHub, past quarter (15% increase)
> >>
> >>
> >>
>
>