Re: Arrow Rust roadmapping [was Re: [Gandiva] Representing logical query plans in protobuf]

Neville Dipale Sat, 05 Jan 2019 13:21:29 -0800

Hi Wes,

I'm aware of your expressions re. the amount of work that leadership on OSS
projects takes, and for the time aspect, one has to just look at another's
local timezone to see even the hours and days which another works.


To be proactive, I'll hash together such rough roadmap for Rust, and share
it on the mailing list when it's ready. Where I need guidance on features,
I'll put in enough research so I don't spend other contributors' time on
open-ended problems.

Thanks for responding, really appreciate it

Neville

On Sat, 5 Jan 2019 at 23:03, Wes McKinney <[email protected]> wrote:

> hi Neville,
>
> On Sat, Jan 5, 2019 at 2:37 PM Neville Dipale <[email protected]>
> wrote:
> >
> > Hi Andy & Wes,
> >
> > Apologies if I go off-topic a bit, I hope my thoughts are related though.
> >
> > I'm a new contributor to Arrow, but I've been using and following it
> since
> > the feather days. I'm interested in contributing to Rust, as that aligns
> > more with my day job(s).
> >
> > I think we (rather, the Rust contributors) can benefit from direction
> via a
> > roadmap for a few releases (or for 2019), so that new contributors can
> find
> > it easier to add value.
> >
> > What I've observed so far (on Rust and sometimes other languages) is that
> > although there's the grand goal that exists (e.g. Ursa Labs'
> intentions), a
> > lot of work scheduling is haphazard. For example, a lot of JIRAs are
> opened
> > by developers, and then a PR is submitted not long after. Bug reports are
> > exceptions. This phenomenon, even if minor, makes it difficult for
> someone
> > to pick up work and contribute.
> >
> > I would propose the following for Rust and other less-maturely-supported
> > languages like C#:
> >
> > 1. We look at gaps relative to python/cpp feature-wise, and create JIRAs
> > for functionality that doesn't yet exist. For example, Rust doesn't have
> > date/time support (I created a JIRA a few weeks ago)
> > 2. For some of these features where more effort is required, provide some
> > rough outline of what needs to be done.
> > 3. For components/features that are common across languages (CSV), agree
> on
> > overall design which languages can abide to as far as possible. CPP might
> > be the template, but it's already likely that Go and Rust are doing their
> > own thing, which might lead to inconsistent UX to Arrow users down the
> > line. Such a design might already exist, but I haven't seen anything yet.
> > This can include creating common test data like is being done with
> Parquet.
> >
>
> I don't mean to dismiss this concern, but leadership (what you are
> asking for) in any kind of software project (whether open source or
> not) is a _lot_ of work. If there is not an individual with the time
> and space to effectively be a "product manager" then this work
> generally does not happen, and development gets done on an ad hoc
> basis based on what features people need to build the applications
> they are working on.
>
> One of the roles I've played over the last ~3 years in the project is
> the chief JIRA wrangler for the C++ and Python implementations.
> According to
>
> https://cwiki.apache.org/confluence/display/ARROW/JIRA+Health+Dashboard
>
> I have created almost 1500 JIRAs. If you want this work to happen for
> Rust or one of the other implementations, generally either you will
> have to do it, or you will have to find someone to compensate to do
> it. Otherwise there may be a volunteer who will step up, but there's
> no guarantee.
>
> > Beyond in-memory rep, computing kernels are the hot thing on Arrow right
> > now, with Gandiva being the crown jewel. We currently have *array_ops* in
> > Rust, where Andy's been adding some operations (sum, add, mul, etc.).
> >
> > 4. I think we need some explicit decision-making on whether to continue
> > this route, which might not be mutually exclusive to future Gandiva
> > bindings (based on Wes' comments on what Gandiva's role is).
>
> Apache projects are effectively do-ocracies that operate on the basis
> of consensus. It is up to the contributors of the subcomponents to
> self-manage what gets built and what does not get built. Much
> consensus may be "lazy" (where absence of opinions implies consent).
> It is a good idea to discuss objectives and requirements on the
> mailing list so there is a written record of what the consensus was.
> In the absence of "yay" arguments from contributors, ultimately the
> decisions get made by the people doing the work.
>
> > 5. If *array_ops* is the way to go, we could define the types of ops we
> > want to support. This could be as easy as looking at what Pandas or Spark
> > support. Having a growing suite of functions could encourage users to
> build
> > on Arrow like DataFusion is doing. This would also help Andy push the SQL
> > parser and DF's query engine (per goals and roadmap) while other people
> do
> > the grunt-work of various functions.
> >
> > I believe the above could make it clearer for newbies like me, to
> > contribute more to Arrow, and give us a better sense of what we can and
> > can't do with Arrow in our daily applications.
>
> It would be helpful to have a development roadmap for Rust, or for the
> other language implementations. With luck someone will be able to
> volunteer to take on this work.
>
> - Wes
>
> >
> > Thanks
> > Neville
> >
> >
> > On Sat, 5 Jan 2019 at 20:29, Andy Grove <[email protected]> wrote:
> >
> > > Wes,
> > >
> > > That makes sense.
> > >
> > > I'll create a fresh PR to add a new protobuf under the Rust module for
> now
> > > (even though this won't be Rust specific).
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > >
> > > On Sat, Jan 5, 2019 at 9:19 AM Wes McKinney <[email protected]>
> wrote:
> > >
> > > > hey Andy,
> > > >
> > > > I replied on GitHub and then saw your e-mail thread.
> > > >
> > > > The Gandiva library as it stands right now is not a query engine or
> an
> > > > execution engine, properly speaking. It is a subgraph compiler for
> > > > creating accelerated expressions for use inside another execution or
> > > > query engine, like it is being used now in Dremio.
> > > >
> > > > For this reason I am -1 on adding logical query plan definitions to
> > > > Gandiva until a more rigorous design effort takes place to decide
> > > > where to build an actual query/execution engine (which includes file
> /
> > > > dataset scanners, projections, joins, aggregates, filters, etc.) in
> > > > C++. My preference is to start building a from-the-ground-up system
> > > > that will depend on Gandiva to compile expressions during execution.
> > > > Among other things, I don't think it is necessarily a good idea to
> > > > require a query engine to depend on LLVM, so tight coupling to an
> > > > LLVM-based component may not be desirable.
> > > >
> > > > In the meantime, if you want to start creating an (experimental)
> > > > Protobuf / Flatbuffer definition to define a general query execution
> > > > plan (that lives outside Gandiva for the time being) to assist with
> > > > building a query engine in Rust, I think that is fine, but I want to
> > > > make sure we are being deliberate and layering the project components
> > > > in a good way
> > > >
> > > > - Wes
> > > >
> > > > On Sat, Jan 5, 2019 at 8:15 AM Andy Grove <[email protected]>
> wrote:
> > > > >
> > > > > I have created a PR to start a discussion around representing
> logical
> > > > query
> > > > > plans in Gandiva (ARROW-4163).
> > > > >
> > > > > https://github.com/apache/arrow/pull/3319
> > > > >
> > > > > I think that adding the various steps such as projection,
> selection,
> > > > sort,
> > > > > and so on are fairly simple and not contentious. The harder part
> is how
> > > > we
> > > > > represent data sources since this likely has different meanings to
> > > > > different use cases. My thought is that we can register data
> sources by
> > > > > name (similar to CREATE EXTERNAL TABLE in Hadoop) or tie this into
> the
> > > > IPC
> > > > > meta-data somehow so we can pass memory addresses and schema
> > > information.
> > > > >
> > > > > I would love to hear others thoughts on this.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > >
> > >
>

Re: Arrow Rust roadmapping [was Re: [Gandiva] Representing logical query plans in protobuf]

Reply via email to