+1 On Thu, Jun 10, 2021, 23:38 Antoine Pitrou <anto...@python.org> wrote:
> > Sound good enough to me. > > > Le 10/06/2021 à 23:35, Wes McKinney a écrit : > > I hate to reopen this can of worms again, but here is my effort to > > synthesize feedback: > > > > "Apache Arrow is a multi-language toolbox for accelerated data > > interchange and in-memory processing." > > > > On Thu, Jun 10, 2021 at 12:37 PM Dominik Moritz <domor...@apache.org> > wrote: > >> > >> I thought there were some good suggestions in this thread. @Wes, did you > >> find a description you liked? > >> > >> On May 18, 2021 at 06:24:47, Adam Hooper <a...@adamhooper.com> wrote: > >> > >>> Poll question: why did you choose Arrow? > >>> > >>> Personally: I researched Arrow because it's a spec for IPC. (My > requirement > >>> was: "wrap computations in a separate process.") I chose Arrow for its > >>> community and ecosystem -- in other words, because my peers chose it. > >>> > >>> I happen to use the compute kernel and Parquet capabilities every day; > but > >>> they did not sway me at all. I would choose Arrow if it were nothing > but > >>> this spec and this community. (I chose HTML, after all.) > >>> > >>> I see the *code* as one enormous proof that the *spec* is good, and as > a > >>> collection of examples and best practices. > >>> > >>> ... so a great pitch to me would be: "Apache Arrow is a data format and > >>> toolbox for efficient in-memory processing." > >>> > >>> Enjoy life, > >>> Adam > >>> > >>> On Tue, May 18, 2021 at 2:38 AM Aldrin <akmon...@ucsc.edu.invalid> > wrote: > >>> > >>> "Apache Arrow is a data processing library that also provides a > uniform, > >>> > >>> efficient interface for data systems." > >>> > >>> > >>> This probably still isn't quite right, I imagine the bit about "for > data > >>> > >>> systems" needs some addition (maybe "for transport between data > systems")? > >>> > >>> > >>> My primary motivators: > >>> > >>> > >>> - "A data processing library": > >>> > >>> - Arrow provides many language bindings, but ultimately they're > all > >>> > >>> part of the same "library ecosystem", which I think is fine to > >>> > >>> capture in > >>> > >>> "library" > >>> > >>> - A main goal of arrow is for processing to be fast, whatever > that > >>> > >>> processing may be > >>> > >>> - "uniform, efficient interface for data systems": > >>> > >>> - Arrow, provides (or tries to) a cohesive ("uniform") > interface for > >>> > >>> data processing (although it has several APIs to do this) > >>> > >>> - Also, IMO, a motivation for arrow was a format and library to > >>> > >>> facilitate processing, but that provided functions and > >>> > >>> interfaces to easily > >>> > >>> translate into optimized data formats used by disparate data > systems > >>> > >>> (cassandra, hadoop, etc.). > >>> > >>> - Arrow tries to be transparently zero-copy, which is part of > the > >>> > >>> interface for efficiency > >>> > >>> - Arrow certainly has a data format, but that format is the crux > of the > >>> > >>> interface (IMO). However, it also makes using other formats easy > (via > >>> > >>> filesystem API and parquet reader/writers, etc.). So, focusing on > the > >>> > >>> data > >>> > >>> format seems unnecessary in such a terse description. > >>> > >>> > >>> > >>> Aldrin Montana > >>> > >>> Computer Science PhD Student > >>> > >>> UC Santa Cruz > >>> > >>> > >>> > >>> On Mon, May 17, 2021 at 5:07 PM Weston Pace <weston.p...@gmail.com> > wrote: > >>> > >>> > >>>> I'd avoid the word "structured" as it is somewhat ill-defined. > >>> > >>>> > >>> > >>>> On Mon, May 17, 2021 at 12:37 PM Mauricio Vargas > >>> > >>>> <mauri...@ursacomputing.com> wrote: > >>> > >>>>> > >>> > >>>>> more marketed: > >>> > >>>>> How about: "Apache Arrow is a format and language-agnostic library > >>> > >>>> focused > >>> > >>>>> on efficient sharing and processing of structured data." > >>> > >>>>> > >>> > >>>>> On Mon, May 17, 2021 at 6:25 PM Micah Kornfield < > emkornfi...@gmail.com > >>> > >>>> > >>> > >>>>> wrote: > >>> > >>>>> > >>> > >>>>>> How about: "Apache Arrow is a collection of specifications, cross > >>> > >>>> language > >>> > >>>>>> libraries and applications focused on efficient sharing and > >>> > >>> processing > >>> > >>>> of > >>> > >>>>>> structured data." > >>> > >>>>>> > >>> > >>>>>> On Mon, May 17, 2021 at 3:06 PM Wes McKinney <wesmck...@gmail.com> > >>> > >>>> wrote: > >>> > >>>>>> > >>> > >>>>>>> On Mon, May 17, 2021 at 4:58 PM Weston Pace <weston.p...@gmail.com > >>> > >>>> > >>> > >>>>>> wrote: > >>> > >>>>>>>> > >>> > >>>>>>>>> “Apache Arrow is a format and compute kernel for in-memory > >>> > >>> data” > >>> > >>>>>>>> > >>> > >>>>>>>> I like this but no one ever knows what "in-memory" means (or they > >>> > >>>> just > >>> > >>>>>>>> think 'data is always in memory'). How about... > >>> > >>>>>>>> > >>> > >>>>>>>> "Apache Arrow is a format and compute kernel for zero-copy > >>> > >>>> processing > >>> > >>>>>>>> and sharing of data." > >>> > >>>>>>>> > >>> > >>>>>>>> or... > >>> > >>>>>>>> > >>> > >>>>>>>> "Apache Arrow is a format and compute kernel for processing and > >>> > >>>>>>>> sharing data without serialization overhead." > >>> > >>>>>>> > >>> > >>>>>>> A few issues with this: > >>> > >>>>>>> > >>> > >>>>>>> * Multiple PL aspect unclear (is a single piece of software, or > >>> > >>>>>>> multiple pieces of software?) > >>> > >>>>>>> * Development platform aspect unclear > >>> > >>>>>>> > >>> > >>>>>>> I see that some people don't like the word "platform". Some people > >>> > >>>>>>> come to this project and want to find an end-to-end application, > >>> > >>>>>>> rather than a developer toolkit that they can use to build > >>> > >>>>>>> applications. Perhaps we should be more explicit and use > >>> > >>>>>>> "computational development toolkit" instead of "platform". > >>> > >>>>>>> > >>> > >>>>>>>> Although marshalling[1] would probably be a more precise word it > >>> > >>> is > >>> > >>>>>>>> not as well known. > >>> > >>>>>>>> > >>> > >>>>>>>> [1] https://en.wikipedia.org/wiki/Marshalling_(computer_science) > >>> > >>>>>>>> > >>> > >>>>>>>> On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas > >>> > >>>>>>>> <mauri...@ursacomputing.com> wrote: > >>> > >>>>>>>>> > >>> > >>>>>>>>> a few ideas > >>> > >>>>>>>>> > >>> > >>>>>>>>> github.com/apache/arrow - Apache Arrow is an efficient library > >>> > >>>> for > >>> > >>>>>>> big data > >>> > >>>>>>>>> processing and sharing > >>> > >>>>>>>>> > >>> > >>>>>>>>> github.com/apache/arrow - Apache Arrow is a computational tool > >>> > >>>> for > >>> > >>>>>>>>> processing, storing and sharing large datasets > >>> > >>>>>>>>> > >>> > >>>>>>>>> github.com/apache/arrow - Apache Arrow is a fast and simple > >>> > >>>> library > >>> > >>>>>>> for > >>> > >>>>>>>>> big data analytics > >>> > >>>>>>>>> > >>> > >>>>>>>>> *github.com/apache/arrow <http://github.com/apache/arrow> - > >>> > >>>> Apache > >>> > >>>>>>> Arrow is > >>> > >>>>>>>>> a powerful workhorse for analytic operations on modern > >>> > >>> hardware* > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> On Mon, May 17, 2021 at 3:13 PM Julian Hyde < > >>> > >>>> jhyde.apa...@gmail.com> > >>> > >>>>>>> wrote: > >>> > >>>>>>>>> > >>> > >>>>>>>>>> Alright, well, whatever it is, it must fit into one breath. > >>> > >>> If > >>> > >>>> the > >>> > >>>>>>>>>> high-concept pitch is successful, people will stick around > >>> > >>> for > >>> > >>>> the > >>> > >>>>>>> full > >>> > >>>>>>>>>> pitch. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> Words such as “platform” and “enable” are noise. You say > >>> > >>>>>> “platform”, > >>> > >>>>>>> they > >>> > >>>>>>>>>> start to say “what exactly do you mean by platform”, the > >>> > >>>> elevator > >>> > >>>>>>> doors > >>> > >>>>>>>>>> open, and they’re gone. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> “Apache Arrow is a format and compute kernel for in-memory > >>> > >>>> data” > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> > >>> > >>>>>>>>>>> On May 17, 2021, at 12:03 PM, Eduardo Ponce < > >>> > >>>> edponc...@gmail.com > >>> > >>>>>>> > >>> > >>>>>>> wrote: > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> One more suggestion for the bucket: > >>> > >>>>>>>>>>> "Apache Arrow is a computational platform for efficient > >>> > >>>> in-memory > >>> > >>>>>>> data > >>> > >>>>>>>>>>> representation and processing." > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> On Mon, May 17, 2021 at 2:49 PM Wes McKinney < > >>> > >>>>>> wesmck...@gmail.com> > >>> > >>>>>>>>>> wrote: > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> I think less is better in the description, but > >>> > >>>> unfortunately the > >>> > >>>>>>>>>>>> association of Arrow as being "just a data format" has > >>> > >>> been > >>> > >>>>>>> actively > >>> > >>>>>>>>>>>> harmful in some ways to community growth. We have a data > >>> > >>>> format, > >>> > >>>>>>> yes, > >>> > >>>>>>>>>>>> but we are also creating a computational platform to go > >>> > >>>>>>> hand-in-hand > >>> > >>>>>>>>>>>> with the data format to make it easier to build fast > >>> > >>>>>> applications > >>> > >>>>>>> that > >>> > >>>>>>>>>>>> use the data format. So the description needs to capture > >>> > >>>> both of > >>> > >>>>>>> these > >>> > >>>>>>>>>>>> ideas. > >>> > >>>>>>>>>>>> > >>> > >>>>>>>>>>>> On Mon, May 17, 2021 at 12:15 PM Julian Hyde < > >>> > >>>>>>> jhyde.apa...@gmail.com> > >>> > >>>>>>>>>>>> wrote: > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> I think that the “cross-language development platform > >>> > >>> for” > >>> > >>>> is > >>> > >>>>>>> noise. > >>> > >>>>>>>>>>>> (I’m sure that JPEG developers think that JPEG is a > >>> > >>>>>>> “cross-language > >>> > >>>>>>>>>>>> development platform” too. But it isn’t. It is an image > >>> > >>>> format.) > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> "Apache Arrow is data format for efficient in-memory > >>> > >>>>>> processing.” > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> I’ll note that In marketing speak, we are developing a > >>> > >>>>>>> high-concept > >>> > >>>>>>>>>>>> pitch [1] here. Every company needs a name, a brand, a > >>> > >>>>>>> high-concept > >>> > >>>>>>>>>> pitch, > >>> > >>>>>>>>>>>> and 3- or 4-sentence description. But every Apache project > >>> > >>>> needs > >>> > >>>>>>> these > >>> > >>>>>>>>>> too. > >>> > >>>>>>>>>>>> It’s worth spending the time on the description, also, and > >>> > >>>> then > >>> > >>>>>>> use > >>> > >>>>>>>>>> them in > >>> > >>>>>>>>>>>> all the places that we describe Arrow. > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> Julian > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> [1] > >>> > >>>>>>> https://www.growthink.com/content/whats-your-high-concept-pitch > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> On May 17, 2021, at 7:38 AM, Eduardo Ponce < > >>> > >>>>>> edponc...@gmail.com > >>> > >>>>>>>> > >>> > >>>>>>>>>>>> wrote: > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> I agree with Nate's and Brian's suggestions, but would > >>> > >>>> like to > >>> > >>>>>>> add > >>> > >>>>>>>>>>>> that we > >>> > >>>>>>>>>>>>>> can make it a one-liner for more conciseness and > >>> > >>>> consistency > >>> > >>>>>>> with > >>> > >>>>>>>>>> other > >>> > >>>>>>>>>>>>>> Apache projects. > >>> > >>>>>>>>>>>>>> Apologies if it seems I am going around the suggestions > >>> > >>>> loop > >>> > >>>>>>> again. > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> "Apache Arrow is a cross-language development platform > >>> > >>>>>> enabling > >>> > >>>>>>>>>>>> efficient > >>> > >>>>>>>>>>>>>> in-memory data processing and transport." > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> On Mon, May 17, 2021 at 10:11 AM Brian Hulette < > >>> > >>>>>>> bhule...@apache.org> > >>> > >>>>>>>>>>>> wrote: > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Thank you for bringing this up Dominik. I sampled some > >>> > >>>> of the > >>> > >>>>>>>>>>>> descriptions > >>> > >>>>>>>>>>>>>>> for other Apache projects I frequent, the ones with a > >>> > >>>>>>> meaningful > >>> > >>>>>>>>>>>>>>> description have a single sentence: > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> github.com/apache/spark - Apache Spark - A unified > >>> > >>>> analytics > >>> > >>>>>>> engine > >>> > >>>>>>>>>>>> for > >>> > >>>>>>>>>>>>>>> large-scale data processing > >>> > >>>>>>>>>>>>>>> github.com/apache/beam - Apache Beam is a unified > >>> > >>>>>> programming > >>> > >>>>>>> model > >>> > >>>>>>>>>>>> for > >>> > >>>>>>>>>>>>>>> Batch and Streaming > >>> > >>>>>>>>>>>>>>> github.com/apache/avro - Apache Avro is a data > >>> > >>>> serialization > >>> > >>>>>>> system > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Several others (Flink, Hadoop, ...) just have "[Mirror > >>> > >>>> of] > >>> > >>>>>>> Apache > >>> > >>>>>>>>>>>> <name>" > >>> > >>>>>>>>>>>>>>> as the description. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> +1 for Nate's suggestion "Apache Arrow is a > >>> > >>>> cross-language > >>> > >>>>>>>>>> development > >>> > >>>>>>>>>>>>>>> platform for in-memory data. It enables systems to > >>> > >>>> process > >>> > >>>>>> and > >>> > >>>>>>>>>>>> transport > >>> > >>>>>>>>>>>>>>> data more efficiently." > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> On Mon, May 17, 2021 at 5:23 AM Wes McKinney < > >>> > >>>>>>> wesmck...@gmail.com> > >>> > >>>>>>>>>>>> wrote: > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> It's probably best for description to limit mentions > >>> > >>> of > >>> > >>>>>>> specific > >>> > >>>>>>>>>>>>>>>> features. There are some high level features mentioned > >>> > >>>> in > >>> > >>>>>> the > >>> > >>>>>>>>>>>>>>>> description now ("computational libraries and > >>> > >>> zero-copy > >>> > >>>>>>> streaming > >>> > >>>>>>>>>>>>>>>> messaging and interprocess communication"), but now in > >>> > >>>> 2021 > >>> > >>>>>>> since > >>> > >>>>>>>>>> the > >>> > >>>>>>>>>>>>>>>> project has grown so much, it could leave people with > >>> > >>> a > >>> > >>>>>>> limited view > >>> > >>>>>>>>>>>>>>>> of what they might find here. > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas > >>> > >>>>>>>>>>>>>>>> <mauri...@ursacomputing.com> wrote: > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> How about > >>> > >>>>>>>>>>>>>>>>> 'Apache Arrow is a cross-language development > >>> > >>> platform > >>> > >>>> for > >>> > >>>>>>>>>> in-memory > >>> > >>>>>>>>>>>>>>>> data. > >>> > >>>>>>>>>>>>>>>>> It enables systems to process and transport data > >>> > >>>>>> efficiently, > >>> > >>>>>>>>>>>>>>> providing a > >>> > >>>>>>>>>>>>>>>>> simple and fast library for partitioning of large > >>> > >>>> tables'? > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> Sorry the delay, long election day > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind < > >>> > >>>>>>>>>>>>>>>> natebauernfe...@deephaven.io> > >>> > >>>>>>>>>>>>>>>>> wrote: > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>> Suggestion: faster -> more efficiently > >>> > >>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>> "Apache Arrow is a cross-language development > >>> > >>>> platform for > >>> > >>>>>>>>>>>> in-memory > >>> > >>>>>>>>>>>>>>>>>> data. It enables systems to process and transport > >>> > >>> data > >>> > >>>>>> more > >>> > >>>>>>>>>>>>>>>> efficiently." > >>> > >>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>> On Sun, May 16, 2021 at 11:35 AM Wes McKinney < > >>> > >>>>>>>>>> wesmck...@gmail.com > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> wrote: > >>> > >>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>> Here's what there now: > >>> > >>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>> "Apache Arrow is a cross-language development > >>> > >>>> platform > >>> > >>>>>> for > >>> > >>>>>>>>>>>>>>> in-memory > >>> > >>>>>>>>>>>>>>>>>>> data. It specifies a standardized > >>> > >>>> language-independent > >>> > >>>>>>> columnar > >>> > >>>>>>>>>>>>>>>> memory > >>> > >>>>>>>>>>>>>>>>>>> format for flat and hierarchical data, organized > >>> > >>> for > >>> > >>>>>>> efficient > >>> > >>>>>>>>>>>>>>>>>>> analytic operations on modern hardware. It also > >>> > >>>> provides > >>> > >>>>>>>>>>>>>>>> computational > >>> > >>>>>>>>>>>>>>>>>>> libraries and zero-copy streaming messaging and > >>> > >>>>>>> interprocess > >>> > >>>>>>>>>>>>>>>>>>> communication…" > >>> > >>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>> How about something shorter like > >>> > >>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>> "Apache Arrow is a cross-language development > >>> > >>>> platform > >>> > >>>>>> for > >>> > >>>>>>>>>>>>>>> in-memory > >>> > >>>>>>>>>>>>>>>>>>> data. It enables systems to process and transport > >>> > >>>> data > >>> > >>>>>>> faster." > >>> > >>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>> Suggestions / refinements from others welcome > >>> > >>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>> On Sat, May 15, 2021 at 9:12 PM Dominik Moritz < > >>> > >>>>>>> domor...@cmu.edu > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> wrote: > >>> > >>>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>>> Super minor issue but could someone make the > >>> > >>>> description > >>> > >>>>>>> on > >>> > >>>>>>>>>>>>>>> GitHub > >>> > >>>>>>>>>>>>>>>>>>> shorter? > >>> > >>>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>>> GitHub puts the description into the title of the > >>> > >>>> page > >>> > >>>>>>> and makes > >>> > >>>>>>>>>>>>>>> it > >>> > >>>>>>>>>>>>>>>>>> hard > >>> > >>>>>>>>>>>>>>>>>>> to find it in URL autocomplete. > >>> > >>>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>>> -- > >>> > >>>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>> > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> > >>> > >>>>>>> > >>> > >>>>>> > >>> > >>>> > >>> > >>> > >>> > >>> > >>> -- > >>> Adam Hooper > >>> +1-514-882-9694 > >>> http://adamhooper.com > >>> >