Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Josh Taylor Thu, 27 May 2021 20:22:03 -0700

I played around with it, for my use case I really like the new way of
writing CSVs, it's much more obvious. I love the `read_stream_metadata`
function as well.


I'm seeing a very slight speed (~8ms) improvement on my end, but I read a
bunch of files in a directory and spit out a CSV, the bottleneck is the
parsing of lots of files, but it's pretty quick per file.

old:
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
bytes took 1ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
bytes took 1ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
17127928 bytes took 159ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
17127144 bytes took 160ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
17130352 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
17128544 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
17128664 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
17128328 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
17129288 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
17131056 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
17130344 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
17128432 bytes took 160ms

new:
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
bytes took 1ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
bytes took 1ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
17127928 bytes took 157ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
17127144 bytes took 152ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
17130352 bytes took 154ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
17128544 bytes took 153ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
17128664 bytes took 154ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
17128328 bytes took 153ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
17129288 bytes took 152ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
17131056 bytes took 153ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
17130344 bytes took 155ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
17128432 bytes took 153ms

I'm going to chunk the dirs to speed up the reads and throw it into a par
iter.

On Fri, 28 May 2021 at 09:09, Josh Taylor <joshuatayl...@gmail.com> wrote:

> Hi!
>
> I've been using arrow/arrow-rs for a while now, my use case is to parse
> Arrow streaming files and convert them into CSV.
>
> Rust has been an absolute fantastic tool for this, the performance is
> outstanding and I have had no issues using it for my use case.
>
> I would be happy to test out the branch and let you know what the
> performance is like, as I was going to improve the current implementation
> that i have for the CSV writer, as it takes a while for bigger datasets
> (multi-GB).
>
> Josh
>
>
> On Thu, 27 May 2021 at 22:49, Jed Brown <j...@jedbrown.org> wrote:
>
>> Andy Grove <andygrov...@gmail.com> writes:
>> >
>> > Looking at this purely from the DataFusion/Ballista point of view, what
>> I
>> > would be interested in would be having a branch of DF that uses arrow2
>> and
>> > once that branch has all tests passing and can run queries with
>> performance
>> > that is at least as good as the original arrow crate, then cut over.
>> >
>> > However, for developers using the arrow APIs directly, I don't see an
>> easy
>> > path. We either try and gradually PR the changes in (which seems really
>> > hard given that there are significant changes to APIs and internal data
>> > structures) or we port some portion of the existing tests over to arrow2
>> > and then make that the official crate once all test pass.
>>
>> How feasible would it be to make a legacy module in arrow2 that would
>> enable (some large subset of) existing arrow users to try arrow2 after
>> adjusting their use statements? (That is, implement the public-facing
>> legacy interfaces in terms of arrow2's new, safe interface.) This would
>> make it easier to test with DataFusion/Ballista and external users of the
>> current arrow crate, then cut over and let those packages update
>> incrementally from legacy to modern arrow2.
>>
>> I think it would be okay to tolerate some performance degradation when
>> working through these legacy interfaces,so long as there was confidence
>> that modernizing the callers would recover the performance (as tests have
>> been showing).
>>
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Reply via email to