Re: [DISCUSS] State of the Arrow Project 2022

Jacob Wujciak Sat, 07 Jan 2023 15:23:51 -0800

+1 to the existing suggestions, this is such a great thread, great start to
the year!
A theme from this thread I would like to pick up is "community ux": The
community overview on the arrow page is quite small and afaik the different
syncalls and zulip are not documented anywhere, so we should improve that
*ideal with more graphics too :D). With the number of different arrow
subprojects continuing to grow it might be nice to have a sort of
'community overview' that helps people new to arrow orient themselves in
and explore the arrow ecosystem. Ideally this would also be accessible from
github as I think there are a number of people that enter the arrow world
purely through gh.


I think for the roadmaps we could use projects in the respective repos. As
reference the official GH roadmap does this too [1].

I have opened a PR with a PR template [2], please review and add feedback
to the wording, I adapted the existing rust templates and hint that gets
posted when the title is malformed.

[1]: https://github.com/orgs/github/projects/4247
[2]: https://github.com/apache/arrow/pull/15250

On Sat, Jan 7, 2023 at 8:53 PM Andrew Lamb <al...@influxdata.com> wrote:

> We have used pull request templates in the various rust projects to good
> effect: most PRs clearly describe what they are doing and why.
>
> For your reference, they are at arrow-rs[1] and arrow-datafusion[2].
>
> [1]
>
> https://raw.githubusercontent.com/apache/arrow-rs/master/.github/pull_request_template.md
> [2]
>
> https://raw.githubusercontent.com/apache/arrow-datafusion/master/.github/pull_request_template.md
>
> On Fri, Jan 6, 2023 at 11:18 PM Will Jones <will.jones...@gmail.com>
> wrote:
>
> > Thanks, Kevin.
> >
> > Documenting a process for determining who should be included on a code
> > > review would be helpful.
> > >
> >
> > That's a good idea. We have a docs page directed at contributors, but I'm
> > not sure how many people have read it [1]. This would be a good addition
> to
> > it. (There's also a good guide on reviewing contributions [2].) I also
> like
> > the idea of pull request templates, and it seems like if we provide a
> link
> > in the template to this overview, more of our contributors would read the
> > guide. I have created an issue for this [3].
> >
> >  Also +1 on more diagrams. I've created a couple recently (for example
> [4])
> > and hope to make more.
> >
> > [1] https://arrow.apache.org/docs/developers/overview.html
> > [2] https://arrow.apache.org/docs/developers/reviewing.html
> > [3] https://github.com/apache/arrow/issues/15232
> > [4] https://arrow.apache.org/docs/format/Glossary.html#term-table
> >
> > On Fri, Jan 6, 2023 at 12:26 PM Kevin Gurney <kgur...@mathworks.com>
> > wrote:
> >
> > > Thank you for starting this discussion, Andrew!
> > >
> > > Fiona, Sreehari, and I thought a bit about this, and I've summarized
> some
> > > of our thoughts below.
> > >
> > > Continue:
> > >
> > > 1. +1 to Will's suggestion about roadmaps for sub-projects. This is
> > > something that would be helpful for the MATLAB interface, for example.
> We
> > > would also be interested in the possibility of exploring a MATLAB sync
> > call
> > > if it would be of interest to other community members.
> > >
> > > 2. Continue focusing on building an inclusive developer community.
> Finish
> > > the work required to rename the master branch to main. Consider running
> > > automated checks on pull requests using a tool like alex [1] to prevent
> > use
> > > of inappropriate language and terminology.
> > >
> > > Start:
> > >
> > > 1. Add more visuals and diagrams to the documentation. It can be pretty
> > > overwhelming for new community members to look at the in-depth Arrow
> C++
> > > documentation and be able to quickly get a high-level understanding of
> > how
> > > the various data structures (e.g. buffer, array, chunked array, record
> > > batch, table, field, schema, data type, etc.) relate to one another.
> > Having
> > > more visuals with clear labels that show the relationship between these
> > key
> > > concepts would be very helpful. This also applies to other parts of the
> > > documentation, like the CI systems (e.g. crossbow), which have a lot of
> > > moving parts.
> > >
> > > 2. Use pull request templates. This would hopefully make it easier for
> > > both new and existing contributors to describe their changes in a
> focused
> > > and clear way to others. For example, when making pull requests related
> > to
> > > the MATLAB interface, we've been trying to follow a fairly consistent
> > > pattern for pull request descriptions which includes sections like
> > > "Overview", "Implementation", "Testing", "Future Directions", "Notes",
> > etc.
> > >
> > > Stop:
> > >
> > > 1. +1 to Andrew's point about the reliance on a small number of core
> > > contributors for code reviews. Documenting a process for determining
> who
> > > should be included on a code review would be helpful.
> > >
> > > [1] https://github.com/get-alex/alex
> > >
> > > ________________________________
> > > From: Dewey Dunnington <de...@voltrondata.com.INVALID>
> > > Sent: Tuesday, January 3, 2023 2:33 PM
> > > To: dev@arrow.apache.org <dev@arrow.apache.org>
> > > Subject: Re: [DISCUSS] State of the Arrow Project 2022
> > >
> > > First, a +1000 on Will's blog post! [1]
> > >
> > > Continue:
> > >
> > > Building tools that benefit users of all languages, with particular
> kudos
> > > to ADBC for providing an ABI-stable way to write database drivers that
> > can
> > > be used by practitioners in C++, Ruby, Python, Java, Go, and (soon!) R.
> > >
> > > Start:
> > >
> > > I wonder if this is the year that we can find a way to write compute
> > > functions in such a way that separate implementations don't have to
> exist
> > > for C++, Go, and Rust (and maybe others I don't know about).
> > >
> > > Stop:
> > >
> > > Will's comment that we should stop building data scientist-facing tools
> > > under the Arrow name struck a particular chord with me...the R package
> is
> > > very much data scientist facing and we have a rather large disjoint
> > between
> > > the technical capacity of our users and the technical capacity required
> > to
> > > contribute to the package (e.g., maintaining a development Arrow C++
> > > install). The types of things we have to do to make RecordBatchReader,
> > > Arrays, Buffer, RecordBatch and Table structures available to R users
> and
> > > the types of things we have to do to provide an Acero dplyr backend are
> > > vastly different.
> > >
> > > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> > > https://www.datawill.io/posts/apache-arrow-2022-reflection>
> > >
> > > On Thu, Dec 29, 2022 at 4:09 PM Jacob Wujciak
> > > <ja...@voltrondata.com.invalid>
> > > wrote:
> > >
> > > > This is a great idea, I will add some thoughts later but just wanted
> to
> > > > quickly add that the Zulip Chat [1] was recently switched to allow
> > anyone
> > > > to register without the need for an invite link!
> > > > [1]: https://ursalabs.zulipchat.com/<https://ursalabs.zulipchat.com>
> > > >
> > > >
> > > > On Wed, Dec 28, 2022 at 11:27 PM Will Jones <will.jones...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Thanks for suggesting this Andrew.
> > > > >
> > > > > I just uploaded a blog post with my thoughts in long form [1]. Here
> > are
> > > > > some suggestions pulled from that:
> > > > >
> > > > > Continue:
> > > > >
> > > > > I hope we will continue prioritizing updating the spec for new
> array
> > > > > formats. [2] I think this is very important for avoiding
> > fragmentation
> > > > and
> > > > > may even open opportunities for consolidation in the C++ ecosystem.
> > > > >
> > > > > +1 on additional improvements for documentation, examples,
> no-invite
> > > > chats.
> > > > > I am particularly keen on seeing evangelism for our protocols;
> > existing
> > > > > ones like C Data Interface aren't nearly as widely known as they
> > ought
> > > to
> > > > > be and I'm excited for new ones like ADBC.
> > > > >
> > > > > Start:
> > > > >
> > > > > Find ways for each subproject to publicly develop a clear roadmap.
> > > > > Otherwise by default these discussions happen in private, either
> > > between
> > > > > individual ICs or within corporate environments. Some subprojects,
> > such
> > > > as
> > > > > Acero could likely use their own sync call to help facilitate this,
> > > even
> > > > if
> > > > > on a slower cadence than the main biweekly call.
> > > > >
> > > > > Also, other sync calls might consider adapting to the sync call
> note
> > > > style
> > > > > used in the Rust projects, where all notes are in one google doc
> [3]
> > > > rather
> > > > > than spread across main mailing list threads. That seems like a
> > format
> > > > that
> > > > > would make it easy for new contributors to catch up on the major
> > > focuses
> > > > of
> > > > > the project.
> > > > >
> > > > > Stop:
> > > > >
> > > > > Don't create end-user (e.g. data scientist) facing tools under the
> > name
> > > > > Arrow; prefer keeping separate brand identities for those tools and
> > > > keeping
> > > > > arrow libraries as developer-facing libraries.
> > > > >
> > > > > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> > > https://www.datawill.io/posts/apache-arrow-2022-reflection/>
> > > > > [2]
> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > <
> > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq>
> > > > > [3]
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> > > <
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> > > >
> > > > >
> > > > > On Mon, Dec 26, 2022 at 10:12 AM Andrew Lamb <al...@influxdata.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I am very excited and honored to help steer the Arrow Project
> this
> > > year
> > > > > as
> > > > > > Arrow PMC Chair.
> > > > > >
> > > > > > Something Kou suggested, and the PMC thought would be valuable,
> is
> > to
> > > > > have
> > > > > > a small retrospective about the state of the project and where we
> > > want
> > > > to
> > > > > > take it. I would like to try doing so via a “state of the
> project”
> > > > type
> > > > > > discussion on this mailing list, inspired by an example from
> Apache
> > > > > Calcite
> > > > > > [1].
> > > > > >
> > > > > > I welcome any / all comments on the following topics: What
> things /
> > > > > > activities, if any, do you you think the Apache Arrow Community
> > > should:
> > > > > >
> > > > > > 1. Continue
> > > > > > 2. Start
> > > > > > 3. Stop
> > > > > >
> > > > > > My thoughts are below.
> > > > > >
> > > > > > Andrew
> > > > > >
> > > > > > [1]
> > https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf
> > > <https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf>
> > > > > >
> > > > > > Continue:
> > > > > >
> > > > > > I hope we can continue to encourage and support community growth,
> > > > focused
> > > > > > especially on supporting the sub projects and their leadership. I
> > > also
> > > > > > would like to continue and grow the outward facing evangelism
> about
> > > the
> > > > > > project with blog posts and presentations.
> > > > > >
> > > > > > Start:
> > > > > >
> > > > > > Lower the barrier to contributors and accepting those
> contributions
> > > > even
> > > > > > more, especially for casual contributors. The move to github
> issues
> > > > from
> > > > > > JIRA I see as one example of lowering this barrier (by reducing
> the
> > > > > > required account maintenance). I would love to see additional
> > > > > improvements
> > > > > > in areas like documentation, examples, no-invite-needed chat,
> etc.
> > > > > >
> > > > > > Stop:
> > > > > >
> > > > > > It would be nice to stop (reduce) the reliance on the relatively
> > > small
> > > > > > number of core contributors for code review. I don’t have any
> > > > particular
> > > > > > insight on how to accomplish this, and suspect we will always
> have
> > > less
> > > > > > review capacity than we would like, but it would be nice to
> > encourage
> > > > the
> > > > > > growth.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] State of the Arrow Project 2022

Reply via email to