> > Could you expand on what exactly you mean by this?
Still a bit blurry on the best-practices behind sending > the Arrow response in Flight and seems like an important point. My understanding is that VectorSchemaRoot was designed to act as a buffer in various parts of a data processing pipeline. Roughly one per processing node and data flows from one VSR to another. This is why Flight/VectorLoader/VectorUnloader all take a single single Root and load it on each "next()" method call. I believe the intent is for these types of pipelines to minimize garbage collection/memory churn (something hard to do with bags of Java objects but sometimes convenience trumps performance). Where would be the best place to post this? > I was thinking about GitHub issues but I am GitHub-centric, not sure if JIRA or mailing list would be better. > FWIW, I filed an RFC issue here, along with a prototype implementation and > sample usage + console output code: > https://github.com/apache/arrow/issues/12618 Different languages have different preferences. For Java, generally for early design something like a Google Doc to allow for initial comments is useful. Since this is already a Github issue I'll try to take a closer look. A couple of things I noticed just browsing: 1. It looks like you use String.getBytes() to extract String values into Arrow. This is brittle, we should explicitly pass UTF-8 encoding. 2. We haven't established an official policy, but we have been supporting LTS versions in Java that aren't EOL, so I don't think we should be using Java 14 features to implement this. 3. Having a complete mapping of expected Java Type -> Arrow type might reduce churn on code reviews (and something people are likely fairly opinionated about). Along these lines would you expect this to be extensible to allow "plugin" conversions? > we are currently working to hide some of the lower level gRPC details, > which may not be so cumbersome in other language implementations. I think a decent amount of effort has been put into abstracting gRPC (and maybe even protobuf) to support other transports. I think both Yibo and David have been prototyping on this to support things like UCX (some discussion has happened on the ML). Cheers, Micah On Sun, Mar 13, 2022 at 11:46 PM Andrew Lamb <al...@influxdata.com> wrote: > It may be only tangentially related but as the Rust implementation works > on arrow Flight (e.g. [1]) we are also working to make the API easier to > work with. In the Rust case, however, we are currently working to hide some > of the lower level gRPC details, which may not be so cumbersome in other > language implementations. > > Andrew > > [1] https://github.com/apache/arrow-rs/pull/1386 > > On Sun, Mar 13, 2022 at 7:15 PM Gavin Ray <ray.gavi...@gmail.com> wrote: > >> FWIW, I filed an RFC issue here, along with a prototype implementation and >> sample usage + console output code: >> >> https://github.com/apache/arrow/issues/12618 >> >> On Sun, Mar 13, 2022 at 10:43 AM Gavin Ray <ray.gavi...@gmail.com> wrote: >> >> > Generally, the preferred pattern is one VectorSchemaRoot that >> >> gets reloaded each time. So an API like >> "df.loadVectorSchemaRoot(root)" >> >> probably makes more sense but we can iterate on this. >> >> >> > >> > Could you expand on what exactly you mean by this? >> > >> > Still a bit blurry on the best-practices behind sending >> > the Arrow response in Flight and seems like an important point. >> > >> > >> > ... creating a new contrib module that maps >> >> from java objects (just like there are JDBC and Avro ones) seems >> >> worthwhile. If you are interested in contributing something like this >> I >> >> think a short design doc would be worth-while. >> >> >> > >> > Where would be the best place to post this? >> > >> > I was thinking about GitHub issues but I am GitHub-centric, >> > not sure if JIRA or mailing list would be better. >> > >> > Thanks, Micah! >> > >> > >> > On Sun, Mar 13, 2022 at 12:46 AM Micah Kornfield <emkornfi...@gmail.com >> > >> > wrote: >> > >> >> Hi Gavin, >> >> >> >> > Just curious whether there is any interest/intention of possibly >> making >> >> a >> >> > higher level API around the basic FlightSQL one? >> >> >> >> >> >> IIUC, I don't think this is an issue with Flight but one with generic >> >> conversion between data into Arrow. I don't think anyone is actively >> >> working on something like this, but creating a new contrib module that >> >> maps >> >> from java objects (just like there are JDBC and Avro ones) seems >> >> worthwhile. If you are interested in contributing something like this >> I >> >> think a short design doc would be worth-while. >> >> >> >> VectorSchemaRoot root = df.toVectorSchemaRoot(); >> >> > listener.setVectorSchemaRoot(root); >> >> > listener.sendVectorSchemaRootContents(); >> >> >> >> >> >> A small nit. Generally, the preferred pattern is one VectorSchemaRoot >> >> that >> >> gets reloaded each time. So an API like >> "df.loadVectorSchemaRoot(root)" >> >> probably makes more sense but we can iterate on this. This wasn't >> >> commonly >> >> understood when some of the other contrib modules were developed. >> >> >> >> Cheers, >> >> Micah >> >> >> >> >> >> On Sat, Mar 12, 2022 at 12:15 PM Gavin Ray <ray.gavi...@gmail.com> >> wrote: >> >> >> >> > While trying to implement and introduce the idea of adopting >> FlightSQL, >> >> the >> >> > largest challenge was the API itself >> >> > >> >> > I know it's meant to be low-level. But I found that most of the >> >> development >> >> > time was in code to convert to/from >> >> > row-based data (IE Map<String, Object>) and Java types, and columnar >> >> data + >> >> > Arrow types. >> >> > >> >> > I'm likely in the minority position here -- I know that Arrow and >> >> FlightSQL >> >> > users are largely looking at transferring large volumes of data and >> >> > servicing OLAP-type workloads >> >> > But the thing that excites me most about FlightSQL, isn't its >> >> performance >> >> > (always nice to have), but that it's a language-agnostic standard for >> >> data >> >> > access. >> >> > >> >> > That has broad implications -- for all kinds of data-access workloads >> >> and >> >> > business usecases. >> >> > >> >> > The challenge is that in trying to advocate for it, when presenting a >> >> > proof-of-concept, >> >> > rather than what a developer might expect to see, something like: >> >> > >> >> > // FlightSQL handler code >> >> > List<Map<String, Object>> results = ....; >> >> > results.add(Map.of("id", 1, "name", "Person 1"); >> >> > return results; >> >> > >> >> > A significant portion of the code is in Arrow-specific implementation >> >> > details: >> >> > creating a VectorSchemaRoot, FieldVector, de-serializing the results >> on >> >> the >> >> > client, etc. >> >> > >> >> > Just curious whether there is any interest/intention of possibly >> making >> >> a >> >> > higher level API around the basic FlightSQL one? >> >> > Maybe something closer to the traditional notion of a row-based >> >> "DataFrame" >> >> > or "Table", like: >> >> > >> >> > DataFrame df = new DataFrame(); >> >> > df.addColumn("id", ArrowTypes.Int); >> >> > df.addColumn("name", ArrowTypes.VarChar); >> >> > df.addRow(Map.of("id", 1, "name", "Person 1")); >> >> > VectorSchemaRoot root = df.toVectorSchemaRoot(); >> >> > listener.setVectorSchemaRoot(root); >> >> > listener.sendVectorSchemaRootContents(); >> >> > >> >> >> > >> >