Re: Synergies with Apache Avro?

Micah Kornfield Mon, 01 Nov 2021 19:59:01 -0700

Hi Ismaël,

Apologies for the double post.


Avro is quite conservative about new features but we have support for
> experimental features [2] so backing the format with Arrow could be
> one. The only issue I see from the Java side is introducing the Arrow
> dependencies.

I think reducing dependencies is a good goal.   Arrow's Java integration
with Avro [1] lives in a separate module and hooks into lower level Avro
APIs.  If there is interest in experimentation it would be great to get
this library into a better state (and if there is interest in long term
maintainership in the Avro community, I for one would be happy to help
facilitate this).

[1]
https://github.com/apache/arrow/tree/master/java/adapter/avro/src/main/java/org/apache/arrow

On Mon, Nov 1, 2021 at 7:37 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I am in awe that the 'extra
>> step' of moving from a row to columnar in memory representation has so
>> little overhead, or maybe we can only discover this with more complex
>> schemas.
>
>
> I read Jorge's original e-mail too quickly and didn't realize there were
> links to the benchmarks attached.  It looks like the benchmarks have been
> updated to have a string and int column (before there was only a string
> column populated with "foo", did I get that right Jorge?).  This  raises
> two points:
> 1.  The initial test really was more column->column rather than
> row->column (but again apologies if I misread).  I think this is still a
> good result with regards to memory allocation, and I can imagine
> the transposition to not necessarily be too expensive.
>
> 2.  While Avro->Arrow might yield faster parsing we should be careful to
> benchmark how consumers are going to use APIs we provide.  I imagine for
> DataFusion this would be a net win to have a native Avro->Arrow parser.
> But for consumers that require row based iteration, we need to ensure an
> optimized path from Arrow->Native language bindings as well.  As an
> example, my team at work recently benchmarked two scenarios: 1.  Parsing to
> python dicts per row using fast avro.  2.  Parsing Arrow and then
> converting to python dicts.  We found for primitive type data, #1 was
> actually faster then #2.  I think a large component of this is having to go
> through Arrow C++'s Scalar objects first, which I'm working on addressing,
> but it is a consideration for how and what APIs are potentially exposed.
>
> As I said before, I'm in favor of seeing transformers/parsers that go from
> Avro to Arrow, regardless any performance wins.  Performance wins would
> certainly be a nice benefit :)
>
> Cheers,
> Micah
>
> On Monday, November 1, 2021, Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> +d...@avro.apache.org
>>
>> Hello,
>>
>> Adding dev@avro for awareness.
>>
>> Thanks Jorge for exploring/reporting this. This is an exciting
>> development. I am not aware of any work in the Avro side on
>> optimizations of in-memory representation, so any improvements there
>> could be great. (The comment by Micah about boxing for Java is
>> definitely one, and there could be more). I am in awe that the 'extra
>> step' of moving from a row to columnar in memory representation has so
>> little overhead, or maybe we can only discover this with more complex
>> schemas.
>>
>> The Java implementation serializes to an array of Objects [1] (like
>> Python). Any needed changes to support a different in-memory
>> representation should be reasonable easy to plug, this should be an
>> internal detail that hopefully is not leaking through the user APIs.
>> Avro is quite conservative about new features but we have support for
>> experimental features [2] so backing the format with Arrow could be
>> one. The only issue I see from the Java side is introducing the Arrow
>> dependencies. Avro has fought a long battle to get rid of most of the
>> dependencies to simplify downstream use.
>>
>> For Rust, since the Rust APIs are not yet considered stable and
>> dependencies could be less of an issue I suppose we have 'carte
>> blanche' to back it internally with Arrow specially if it brings
>> performance advantages.
>>
>> There are some benchmarks of a Python version backed by the Rust
>> implementation that are faster than fastavro [3] so we could be into
>> something. Note that the python version on Apache is really slow
>> because it is pure python, but having a version backed by the rust one
>> (and the Arrow in memory improvements) could be a nice project
>> specially if improved by Arrow.
>>
>> Ismaël
>>
>> [1]
>> https://github.com/apache/avro/blob/a1fce29d9675b4dd95dfee9db32cc505d0b2227c/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L223
>> [2]
>> https://cwiki.apache.org/confluence/display/AVRO/Experimental+features+in+Avro
>> [3]
>> https://ep2018.europython.eu/media/conference/slides/how-to-write-rust-instead-of-c-and-get-away-with-it-yes-its-a-python-talk.pdf
>>
>>
>>
>> On Mon, Nov 1, 2021 at 3:36 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> >
>> > Hi Jorge,
>> >>
>> >> The results are a bit surprising: reading 2^20 rows of 3 byte strings
>> is ~6x faster than the official Avro Rust implementation and ~20x faster vs
>> "fastavro"
>> >
>> >
>> > This sentence is a little bit hard to parse.  Is a row of 3 strings or
>> a row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A
>> lot of the complexity of parsing avro is the schema evolution rules, I
>> haven't looked at whether the canonical implementations do any optimization
>> for the happy case when reader and writer schema are the same.
>> >
>> > There is a "Java Avro -> Arrow" implementation checked but it is
>> somewhat broken today (I filed an issue on this a while ago) that delegates
>> parsing the t/from the Avro java library.  I also think there might be
>> faster implementations that aren't the canonical implementations (I seem to
>> recall a JIT version for java for example and fastavro is another).  For
>> both Java and Python I'd imagine there would be some decent speed
>> improvements simply by avoiding the "boxing" task of moving language
>> primitive types to native memory.
>> >
>> > I was planning (and still might get to it sometime in 2022) to have a
>> C++ parser for Avro.  Wes cross-posted this to the Avro mailing list when I
>> thought I had time to work on it a couple of years ago and I don't recall
>> any response to it.  The Rust avro library I believe was also just recently
>> adopted/donated into the Apache Avro project.
>> >
>> > Avro seems to be pretty common so having the ability to convert to and
>> from it is I think is generally valuable.
>> >
>> > Cheers,
>> > Micah
>> >
>> >
>> > On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres <danielhe...@gmail.com>
>> wrote:
>> >>
>> >> Rust allows to easily swap the global allocator to e.g. mimalloc or
>> >> snmalloc, even without the library supporting to change the allocator.
>> In
>> >> my experience this indeed helps with allocation heavy code (I have seen
>> >> changes of up to 30%).
>> >>
>> >> Best regards,
>> >>
>> >> Daniël
>> >>
>> >>
>> >> On Sun, Oct 31, 2021, 18:15 Adam Lippai <a...@rigo.sk> wrote:
>> >>
>> >> > Hi Jorge,
>> >> >
>> >> > Just an idea: Do the Avro libs support different allocators? Maybe
>> using a
>> >> > different one (e.g. mimalloc) would yield more similar results by
>> working
>> >> > around the fragmentation you described.
>> >> >
>> >> > This wouldn't change the fact that they are relatively slow, however
>> it
>> >> > could allow you better apples to apples comparison thus better CPU
>> >> > profiling and understanding of the nuances.
>> >> >
>> >> > Best regards,
>> >> > Adam Lippai
>> >> >
>> >> >
>> >> > On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão <
>> jorgecarlei...@gmail.com
>> >> > >
>> >> > wrote:
>> >> >
>> >> > > Hi,
>> >> > >
>> >> > > I am reporting back a conclusion that I recently arrived at when
>> adding
>> >> > > support for reading Avro to Arrow.
>> >> > >
>> >> > > Avro is a storage format that does not have an associated in-memory
>> >> > > format. In Rust, the official implementation deserializes an enum,
>> in
>> >> > > Python to a vector of Object, and I suspect in Java to an
>> equivalent
>> >> > vector
>> >> > > of object. The important aspect is that all of them use fragmented
>> memory
>> >> > > regions (as opposed to what we do with e.g. one uint8 buffer for
>> >> > > StringArray).
>> >> > >
>> >> > > I benchmarked reading to arrow vs reading via the official Avro
>> >> > > implementations. The results are a bit surprising: reading 2^20
>> rows of 3
>> >> > > byte strings is ~6x faster than the official Avro Rust
>> implementation and
>> >> > > ~20x faster vs "fastavro", a C implementation with bindings for
>> Python
>> >> > (pip
>> >> > > install fastavro), all with a difference slope (see graph below or
>> >> > numbers
>> >> > > and used code here [1]).
>> >> > > [image: avro_read.png]
>> >> > >
>> >> > > I found this a bit surprising because we need to read row by row
>> and
>> >> > > perform a transpose of the data (from rows to columns) which is
>> usually
>> >> > > expensive. Furthermore, reading strings can't be that much
>> optimized
>> >> > after
>> >> > > all.
>> >> > >
>> >> > > To investigate the root cause, I drilled down to the flamegraphs
>> for both
>> >> > > the official avro rust implementation and the arrow2
>> implementation: the
>> >> > > majority of the time in the Avro implementation is spent allocating
>> >> > > individual strings (to build the [str] - equivalents); the
>> majority of
>> >> > the
>> >> > > time in arrow2 is equally divided between zigzag decoding (to get
>> the
>> >> > > length of the item), reallocs, and utf8 validation.
>> >> > >
>> >> > > My hypothesis is that the difference in performance is unrelated
>> to a
>> >> > > particular implementation of arrow or avro, but to a general
>> concept of
>> >> > > reading to [str] vs arrow. Specifically, the item by item
>> allocation
>> >> > > strategy is far worse than what we do in Arrow with a single
>> region which
>> >> > > we reallocate from time to time with exponential growth. In some
>> >> > > architectures we even benefit from the __memmove_avx_unaligned_erms
>> >> > > instruction that makes it even cheaper to reallocate.
>> >> > >
>> >> > > Has anyone else performed such benchmarks or played with Avro ->
>> Arrow
>> >> > and
>> >> > > found supporting / opposing findings to this hypothesis?
>> >> > >
>> >> > > If this hypothesis holds (e.g. with a similar result against the
>> Java
>> >> > > implementation of Avro), it imo puts arrow as a strong candidate
>> for the
>> >> > > default format of Avro implementations to deserialize into when
>> using it
>> >> > > in-memory, which could benefit both projects?
>> >> > >
>> >> > > Best,
>> >> > > Jorge
>> >> > >
>> >> > > [1] https://github.com/DataEngineeringLabs/arrow2-benches
>> >> > >
>> >> > >
>> >> > >
>> >> >
>>
>

Re: Synergies with Apache Avro?

Reply via email to