Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Wes McKinney Sat, 10 Jul 2021 04:26:05 -0700

The process for updating the website is described on

https://incubator.apache.org/guides/website.html


It looks like you need to add the new entries to the index.xml file
and then trigger a website build (which should be triggered by changes
to SVN, but if not you can trigger one manually through Jenkins).

After the new IP clearance pages are visible you should send an IP
clearance lazy consensus vote to [email protected] like

https://lists.apache.org/thread.html/r319b85f0f24f9b0529865387ccfe1b2a00a16f394a48144ba25c3225%40%3Cgeneral.incubator.apache.org%3E

On Sat, Jul 10, 2021 at 7:48 AM Jorge Cardoso Leitão

<[email protected]> wrote:
>
> Thanks a lot Wes,
>
> I am not sure how to proceed from here:
>
> 1. how do we generate the html from the xml? I.e.
> https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
> 2. how do I trigger the the process to start? can I just email the
> incubator with the proposal?
>
> Best,
> Jorge
>
>
>
> On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney <[email protected]> wrote:
>
> > Great, thanks for the update and pushing this forward. Let us know if
> > you need help with anything.
> >
> > On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
> > <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > Wes and Neils,
> > >
> > > Thank you for your feedback and offer. I have created the two .xml
> > reports:
> > >
> > >
> > http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> > >
> > http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> > >
> > > I based them on the report for Ballista. I also requested, on the PRs
> > > [1,2], clarification wrt to every contributors' contributions to each.
> > >
> > > Best,
> > > Jorge
> > >
> > > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> > >
> > >
> > >
> > > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <[email protected]>
> > wrote:
> > >
> > > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > > > <[email protected]> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Thanks a lot for your feedback. I agree with all the arguments put
> > > > forward,
> > > > > including Andrew's point about the large change.
> > > > >
> > > > > I tried a gradual 4 months ago, but it was really difficult and I
> > gave
> > > > up.
> > > > > I estimate that the work involved is half the work of writing
> > parquet2
> > > > and
> > > > > arrow2 in the first place. The internal dependency on ArrayData (the
> > main
> > > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > > components
> > > > > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > > > > compute, SIMD). I personally do not have the motivation to do it,
> > though.
> > > > >
> > > > > Jed, the public API changes are small for end users. A typical
> > migration
> > > > is
> > > > > [1]. I agree that we can further reduce the change-set by keeping
> > legacy
> > > > > interfaces available.
> > > > >
> > > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > > >
> > > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > > memory (-m): 332.9, 239.6
> > > > > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > > > > parquet format: 1316.1, 930.7
> > > > > tbl format: 5297.3, 5383.1
> > > > >
> > > > > i.e. I am observing some improvements. Queries with joins are still
> > > > slower.
> > > > > The pruning of parquet groups and pages based on stats are not yet
> > > > there; I
> > > > > am working on them.
> > > > >
> > > > > I agree that this should go through IP clearance. I will start this
> > > > > process. My thinking would be to create two empty repos on apache/*,
> > and
> > > > > create 2 PRs from the main branches of each of my repos to those
> > repos,
> > > > and
> > > > > only merge them once IP is cleared. Would that be a reasonable
> > process,
> > > > Wes?
> > > >
> > > > This sounds plenty fine to me — I'm happy to assist with the IP
> > > > clearance process having done it several times in the past. I don't
> > > > have an opinion about the names, but having experimental- in the name
> > > > sounds in line with the previous discussion we had about this.
> > > >
> > > > > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> > > > >
> > > > > Best,
> > > > > Jorge
> > > > >
> > > > > [1]
> > > > >
> > > >
> > https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > > > >
> > > > >
> > > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <[email protected]
> > >
> > > > wrote:
> > > > >
> > > > > > I played around with it, for my use case I really like the new way
> > of
> > > > > > writing CSVs, it's much more obvious. I love the
> > `read_stream_metadata`
> > > > > > function as well.
> > > > > >
> > > > > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> > > > read a
> > > > > > bunch of files in a directory and spit out a CSV, the bottleneck
> > is the
> > > > > > parsing of lots of files, but it's pretty quick per file.
> > > > > >
> > > > > > old:
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > 120224
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > 123144
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > 17127928 bytes took 159ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > 17127144 bytes took 160ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > 17130352 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > 17128544 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > 17128664 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > 17128328 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > 17129288 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > 17131056 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > 17130344 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > 17128432 bytes took 160ms
> > > > > >
> > > > > > new:
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > 120224
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > 123144
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > 17127928 bytes took 157ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > 17127144 bytes took 152ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > 17130352 bytes took 154ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > 17128544 bytes took 153ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > 17128664 bytes took 154ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > 17128328 bytes took 153ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > 17129288 bytes took 152ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > 17131056 bytes took 153ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > 17130344 bytes took 155ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > 17128432 bytes took 153ms
> > > > > >
> > > > > > I'm going to chunk the dirs to speed up the reads and throw it
> > into a
> > > > par
> > > > > > iter.
> > > > > >
> > > > > > On Fri, 28 May 2021 at 09:09, Josh Taylor <[email protected]
> > >
> > > > wrote:
> > > > > >
> > > > > > > Hi!
> > > > > > >
> > > > > > > I've been using arrow/arrow-rs for a while now, my use case is to
> > > > parse
> > > > > > > Arrow streaming files and convert them into CSV.
> > > > > > >
> > > > > > > Rust has been an absolute fantastic tool for this, the
> > performance is
> > > > > > > outstanding and I have had no issues using it for my use case.
> > > > > > >
> > > > > > > I would be happy to test out the branch and let you know what the
> > > > > > > performance is like, as I was going to improve the current
> > > > implementation
> > > > > > > that i have for the CSV writer, as it takes a while for bigger
> > > > datasets
> > > > > > > (multi-GB).
> > > > > > >
> > > > > > > Josh
> > > > > > >
> > > > > > >
> > > > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <[email protected]>
> > wrote:
> > > > > > >
> > > > > > >> Andy Grove <[email protected]> writes:
> > > > > > >> >
> > > > > > >> > Looking at this purely from the DataFusion/Ballista point of
> > view,
> > > > > > what
> > > > > > >> I
> > > > > > >> > would be interested in would be having a branch of DF that
> > uses
> > > > arrow2
> > > > > > >> and
> > > > > > >> > once that branch has all tests passing and can run queries
> > with
> > > > > > >> performance
> > > > > > >> > that is at least as good as the original arrow crate, then cut
> > > > over.
> > > > > > >> >
> > > > > > >> > However, for developers using the arrow APIs directly, I don't
> > > > see an
> > > > > > >> easy
> > > > > > >> > path. We either try and gradually PR the changes in (which
> > seems
> > > > > > really
> > > > > > >> > hard given that there are significant changes to APIs and
> > internal
> > > > > > data
> > > > > > >> > structures) or we port some portion of the existing tests
> > over to
> > > > > > arrow2
> > > > > > >> > and then make that the official crate once all test pass.
> > > > > > >>
> > > > > > >> How feasible would it be to make a legacy module in arrow2 that
> > > > would
> > > > > > >> enable (some large subset of) existing arrow users to try arrow2
> > > > after
> > > > > > >> adjusting their use statements? (That is, implement the
> > > > public-facing
> > > > > > >> legacy interfaces in terms of arrow2's new, safe interface.)
> > This
> > > > would
> > > > > > >> make it easier to test with DataFusion/Ballista and external
> > users
> > > > of
> > > > > > the
> > > > > > >> current arrow crate, then cut over and let those packages update
> > > > > > >> incrementally from legacy to modern arrow2.
> > > > > > >>
> > > > > > >> I think it would be okay to tolerate some performance
> > degradation
> > > > when
> > > > > > >> working through these legacy interfaces,so long as there was
> > > > confidence
> > > > > > >> that modernizing the callers would recover the performance (as
> > tests
> > > > > > have
> > > > > > >> been showing).
> > > > > > >>
> > > > > > >
> > > > > >
> > > >
> >

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Reply via email to