Hi Ismaël, Apologies for the double post.
Avro is quite conservative about new features but we have support for > experimental features [2] so backing the format with Arrow could be > one. The only issue I see from the Java side is introducing the Arrow > dependencies. I think reducing dependencies is a good goal. Arrow's Java integration with Avro [1] lives in a separate module and hooks into lower level Avro APIs. If there is interest in experimentation it would be great to get this library into a better state (and if there is interest in long term maintainership in the Avro community, I for one would be happy to help facilitate this). [1] https://github.com/apache/arrow/tree/master/java/adapter/avro/src/main/java/org/apache/arrow On Mon, Nov 1, 2021 at 7:37 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > I am in awe that the 'extra >> step' of moving from a row to columnar in memory representation has so >> little overhead, or maybe we can only discover this with more complex >> schemas. > > > I read Jorge's original e-mail too quickly and didn't realize there were > links to the benchmarks attached. It looks like the benchmarks have been > updated to have a string and int column (before there was only a string > column populated with "foo", did I get that right Jorge?). This raises > two points: > 1. The initial test really was more column->column rather than > row->column (but again apologies if I misread). I think this is still a > good result with regards to memory allocation, and I can imagine > the transposition to not necessarily be too expensive. > > 2. While Avro->Arrow might yield faster parsing we should be careful to > benchmark how consumers are going to use APIs we provide. I imagine for > DataFusion this would be a net win to have a native Avro->Arrow parser. > But for consumers that require row based iteration, we need to ensure an > optimized path from Arrow->Native language bindings as well. As an > example, my team at work recently benchmarked two scenarios: 1. Parsing to > python dicts per row using fast avro. 2. Parsing Arrow and then > converting to python dicts. We found for primitive type data, #1 was > actually faster then #2. I think a large component of this is having to go > through Arrow C++'s Scalar objects first, which I'm working on addressing, > but it is a consideration for how and what APIs are potentially exposed. > > As I said before, I'm in favor of seeing transformers/parsers that go from > Avro to Arrow, regardless any performance wins. Performance wins would > certainly be a nice benefit :) > > Cheers, > Micah > > On Monday, November 1, 2021, Ismaël Mejía <ieme...@gmail.com> wrote: > >> +d...@avro.apache.org >> >> Hello, >> >> Adding dev@avro for awareness. >> >> Thanks Jorge for exploring/reporting this. This is an exciting >> development. I am not aware of any work in the Avro side on >> optimizations of in-memory representation, so any improvements there >> could be great. (The comment by Micah about boxing for Java is >> definitely one, and there could be more). I am in awe that the 'extra >> step' of moving from a row to columnar in memory representation has so >> little overhead, or maybe we can only discover this with more complex >> schemas. >> >> The Java implementation serializes to an array of Objects [1] (like >> Python). Any needed changes to support a different in-memory >> representation should be reasonable easy to plug, this should be an >> internal detail that hopefully is not leaking through the user APIs. >> Avro is quite conservative about new features but we have support for >> experimental features [2] so backing the format with Arrow could be >> one. The only issue I see from the Java side is introducing the Arrow >> dependencies. Avro has fought a long battle to get rid of most of the >> dependencies to simplify downstream use. >> >> For Rust, since the Rust APIs are not yet considered stable and >> dependencies could be less of an issue I suppose we have 'carte >> blanche' to back it internally with Arrow specially if it brings >> performance advantages. >> >> There are some benchmarks of a Python version backed by the Rust >> implementation that are faster than fastavro [3] so we could be into >> something. Note that the python version on Apache is really slow >> because it is pure python, but having a version backed by the rust one >> (and the Arrow in memory improvements) could be a nice project >> specially if improved by Arrow. >> >> Ismaël >> >> [1] >> https://github.com/apache/avro/blob/a1fce29d9675b4dd95dfee9db32cc505d0b2227c/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L223 >> [2] >> https://cwiki.apache.org/confluence/display/AVRO/Experimental+features+in+Avro >> [3] >> https://ep2018.europython.eu/media/conference/slides/how-to-write-rust-instead-of-c-and-get-away-with-it-yes-its-a-python-talk.pdf >> >> >> >> On Mon, Nov 1, 2021 at 3:36 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> > Hi Jorge, >> >> >> >> The results are a bit surprising: reading 2^20 rows of 3 byte strings >> is ~6x faster than the official Avro Rust implementation and ~20x faster vs >> "fastavro" >> > >> > >> > This sentence is a little bit hard to parse. Is a row of 3 strings or >> a row of 1 string consisting of 3 bytes? Was the example hard-coded? A >> lot of the complexity of parsing avro is the schema evolution rules, I >> haven't looked at whether the canonical implementations do any optimization >> for the happy case when reader and writer schema are the same. >> > >> > There is a "Java Avro -> Arrow" implementation checked but it is >> somewhat broken today (I filed an issue on this a while ago) that delegates >> parsing the t/from the Avro java library. I also think there might be >> faster implementations that aren't the canonical implementations (I seem to >> recall a JIT version for java for example and fastavro is another). For >> both Java and Python I'd imagine there would be some decent speed >> improvements simply by avoiding the "boxing" task of moving language >> primitive types to native memory. >> > >> > I was planning (and still might get to it sometime in 2022) to have a >> C++ parser for Avro. Wes cross-posted this to the Avro mailing list when I >> thought I had time to work on it a couple of years ago and I don't recall >> any response to it. The Rust avro library I believe was also just recently >> adopted/donated into the Apache Avro project. >> > >> > Avro seems to be pretty common so having the ability to convert to and >> from it is I think is generally valuable. >> > >> > Cheers, >> > Micah >> > >> > >> > On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres <danielhe...@gmail.com> >> wrote: >> >> >> >> Rust allows to easily swap the global allocator to e.g. mimalloc or >> >> snmalloc, even without the library supporting to change the allocator. >> In >> >> my experience this indeed helps with allocation heavy code (I have seen >> >> changes of up to 30%). >> >> >> >> Best regards, >> >> >> >> Daniël >> >> >> >> >> >> On Sun, Oct 31, 2021, 18:15 Adam Lippai <a...@rigo.sk> wrote: >> >> >> >> > Hi Jorge, >> >> > >> >> > Just an idea: Do the Avro libs support different allocators? Maybe >> using a >> >> > different one (e.g. mimalloc) would yield more similar results by >> working >> >> > around the fragmentation you described. >> >> > >> >> > This wouldn't change the fact that they are relatively slow, however >> it >> >> > could allow you better apples to apples comparison thus better CPU >> >> > profiling and understanding of the nuances. >> >> > >> >> > Best regards, >> >> > Adam Lippai >> >> > >> >> > >> >> > On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão < >> jorgecarlei...@gmail.com >> >> > > >> >> > wrote: >> >> > >> >> > > Hi, >> >> > > >> >> > > I am reporting back a conclusion that I recently arrived at when >> adding >> >> > > support for reading Avro to Arrow. >> >> > > >> >> > > Avro is a storage format that does not have an associated in-memory >> >> > > format. In Rust, the official implementation deserializes an enum, >> in >> >> > > Python to a vector of Object, and I suspect in Java to an >> equivalent >> >> > vector >> >> > > of object. The important aspect is that all of them use fragmented >> memory >> >> > > regions (as opposed to what we do with e.g. one uint8 buffer for >> >> > > StringArray). >> >> > > >> >> > > I benchmarked reading to arrow vs reading via the official Avro >> >> > > implementations. The results are a bit surprising: reading 2^20 >> rows of 3 >> >> > > byte strings is ~6x faster than the official Avro Rust >> implementation and >> >> > > ~20x faster vs "fastavro", a C implementation with bindings for >> Python >> >> > (pip >> >> > > install fastavro), all with a difference slope (see graph below or >> >> > numbers >> >> > > and used code here [1]). >> >> > > [image: avro_read.png] >> >> > > >> >> > > I found this a bit surprising because we need to read row by row >> and >> >> > > perform a transpose of the data (from rows to columns) which is >> usually >> >> > > expensive. Furthermore, reading strings can't be that much >> optimized >> >> > after >> >> > > all. >> >> > > >> >> > > To investigate the root cause, I drilled down to the flamegraphs >> for both >> >> > > the official avro rust implementation and the arrow2 >> implementation: the >> >> > > majority of the time in the Avro implementation is spent allocating >> >> > > individual strings (to build the [str] - equivalents); the >> majority of >> >> > the >> >> > > time in arrow2 is equally divided between zigzag decoding (to get >> the >> >> > > length of the item), reallocs, and utf8 validation. >> >> > > >> >> > > My hypothesis is that the difference in performance is unrelated >> to a >> >> > > particular implementation of arrow or avro, but to a general >> concept of >> >> > > reading to [str] vs arrow. Specifically, the item by item >> allocation >> >> > > strategy is far worse than what we do in Arrow with a single >> region which >> >> > > we reallocate from time to time with exponential growth. In some >> >> > > architectures we even benefit from the __memmove_avx_unaligned_erms >> >> > > instruction that makes it even cheaper to reallocate. >> >> > > >> >> > > Has anyone else performed such benchmarks or played with Avro -> >> Arrow >> >> > and >> >> > > found supporting / opposing findings to this hypothesis? >> >> > > >> >> > > If this hypothesis holds (e.g. with a similar result against the >> Java >> >> > > implementation of Avro), it imo puts arrow as a strong candidate >> for the >> >> > > default format of Avro implementations to deserialize into when >> using it >> >> > > in-memory, which could benefit both projects? >> >> > > >> >> > > Best, >> >> > > Jorge >> >> > > >> >> > > [1] https://github.com/DataEngineeringLabs/arrow2-benches >> >> > > >> >> > > >> >> > > >> >> > >> >