Re: [JS] Exploring usage of apache arrow at my company for complex table rendering

Diana Clarke Sat, 27 Feb 2021 09:56:34 -0800

Speaking of Arrow's JS implementation, there's one small (+2 −2) JS
pull request in the queue that could use a review.


    ARROW-11706: [JS] Better BigInt compatibility check
    https://issues.apache.org/jira/browse/ARROW-11706
    https://github.com/apache/arrow/pull/9110

Have a great weekend, folks!

--diana

On Fri, Feb 26, 2021 at 7:34 PM Weston Pace <[email protected]> wrote:
>
> I used Arrow for this purpose in the past.  I don't have much to add
> but just a few thoughts off the top of my head...
>
> * The line between data and metadata can be blurry - For most
> measurements we were able to store the "expected distribution" as
> metadata (e.g. this measurement should have an expected value of 10
> +/- 3) and that could be used for drawing limit lines.  For some
> measurements however the common practice in place was to store the
> upper/lower limit as separate columns because they often changed
> depending on the various independent variables.  In that case the same
> "concept" (limit) might be stored in data or metadata.
>
> * Distinction between "data" and a "chart" - For us, we introduced a
> separate representation called the "chart" between the data and the
> rendering layer.  So using that limit line example before if we wanted
> to plot a histogram of some column then we would create a bar chart
> from the column.  This bar chart itself was also an array of numbers
> but, since these arrays were much smaller (one per bin, hard limit to
> bin count in the thousands based on # of pixels in display), and the
> structure was much more deeply nested, we ended up just using JSON for
> charts.  The "limit" metadata belonged to the data and it was
> translated into a vertical line element as part of the chart.
>
> * Processing layer - For us it was too expensive to send the data
> across the Internet for display.  So the conversion from data -> chart
> happened with the datacenter close to the actual data.  The JS UI was
> simply responsible for chart -> pixels (well, SVG).  It sounds like
> you plan on doing the processing in JS.  This can work, I'm just
> tossing out alternatives to think about.  You can even have a hybrid
> model where some initial filtering happens in the datacenter and then
> chart calculation / rendering happens in JS.
>
> * Expressions for group/split - Arrow expressions / compute are
> starting to become available (and more work is being done on in-arrow
> query engines).  These can be very helpful for things like grouping or
> splitting.  For example, if you want to plot two line charts, one for
> model X and one for model Y then you can define your split using
> expressions.  Unfortunately, these are pretty big features and I don't
> think they are in the JS library.  However, the existing C++/Rust work
> could serve as examples for how you might want to tackle this.  You
> will need a fair amount of compute to go from data to chart
> (histograms, averages, standard deviations, etc.).  In my case I used
> pandas pretty extensively for this since the Arrow compute features
> didn't exist yet.  There are some JS libraries for this (e.g. d3) so
> you can probably investigate that avenue as well.
>
> On Fri, Feb 26, 2021 at 12:05 PM Paul Taylor <[email protected]> wrote:
> >
> > Hi Michael,
> >
> > The answer to your question about metadata will likely be
> > application-specific.
> >
> > For small amounts of metadata (i.e. communicating a bounding box of
> > included geometry), there isn't much room for optimization, so a string
> > could be fine.
> >
> > For larger amounts of metadata (or other constraints, like if the metadata
> > needs to be constantly modified independent of the data), custom encodings
> > or a second service and/or arrow table of the metadata could be the way to
> > go.
> >
> > The metadata keys/values are UTF-8 strings, so nothing should prevent you
> > from stuffing a base64-encoded protobuf in there.
> >
> > As for whether the library is maintained -- yes it is, but lately I've only
> > had time to work on bug fixes or features required to maintain parity with
> > the spec and other libs.
> >
> > I will be using Arrow JS in my work again soon, and that could justify more
> > "quality of life" improvements again, but without other maintainers jumping
> > in to contribute or needing it for my work, those things don't get done.
> >
> > I'd be happy to do a call with you or your team to give a short overview
> > and introduction to the JS lib. You can also email me directly or in the
> > #arrow-js channel on the-asf.slack.com with any questions.
> >
> > Best,
> > Paul
> >
> > On Fri, Feb 26, 2021 at 1:47 PM Michael Lavina <[email protected]>
> > wrote:
> >
> > > Hey Neal,
> > >
> > > Thanks for the response and I am glad I am using this correctly. I have
> > > never really used email servers so hopefully this works.
> > >
> > > That’s exactly what I was thinking of doing is to create a standard
> > > metadata schema to built on top of Apache Arrow with some predefined user
> > > types.
> > >
> > > I guess I was just wondering if I was trying to use a screwdriver as a
> > > hammer. It can work because we are using the metadata and that could be
> > > anything but maybe like you said we should be creating a separate standard
> > > entirely for defining the schema to render tables instead of defining it
> > > within Arrow.
> > >
> > > Does it defeat the value of Arrow if are sending the data using buffers
> > > and stream and a giant string of stringified metadata when I could maybe
> > > define the metadata in protobuf binary separately.
> > >
> > > In addition, I was curious with all these visualization tools has someone
> > > already developed a standard metadata for arrow to help with rendering.
> > > Stuff like how to denote grouping of data, relationship between columns 
> > > and
> > > hidden information.
> > >
> > > -Michael
> > >
> > > From: Neal Richardson <[email protected]>
> > > Date: Friday, February 26, 2021 at 1:38 PM
> > > To: dev <[email protected]>
> > > Subject: Re: [JS] Exploring usage of apache arrow at my company for
> > > complex table rendering
> > > The Arrow IPC specification allows for custom metadata in both the Schema
> > > and the individual Fields:
> > >
> > > https://urldefense.com/v3/__https://arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$
> > > <
> > > https://urldefense.com/v3/__https:/arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$
> > > >
> > >
> > > Might that work for you? Another alternative would be to track your
> > > metadata in a separate object outside of the Arrow data.
> > >
> > > Neal
> > >
> > > On Fri, Feb 26, 2021 at 5:02 AM Michael Lavina <[email protected]
> > > >
> > > wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > >
> > > >
> > > > Some background. My name is Michael and I work at FactSet, which if you
> > > > use Arrow you may have heard because one of our architects did a talk on
> > > > using Arrow and Dremio.
> > > >
> > > >
> > > >
> > > https://urldefense.com/v3/__https://hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$
> > > <
> > > https://urldefense.com/v3/__https:/hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$
> > > >
> > > >
> > > >
> > > >
> > > > His team has decided to use Arrow as a tabular data interchange format.
> > > > Other teams are doing other things. We are working on standardizing our
> > > > tabular data interchange format at our company.
> > > >
> > > >
> > > >
> > > > We have our own open-sourced columnar based schema defined in protobuf.
> > > >
> > > https://urldefense.com/v3/__https://github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$
> > > <
> > > https://urldefense.com/v3/__https:/github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$
> > > >
> > > >
> > > >
> > > >
> > > > We looked into Apache Arrow a few years ago, but decided not to use it 
> > > > as
> > > > it was not mature enough at the time and we had two specific 
> > > > requirements
> > > >
> > > > 1) We needed this data not just for analytics but rendering as well and
> > > > rendering requires a lot more complicated information such as
> > > understanding
> > > > the type of data and relationship between data i.e. grouping
> > > >
> > > > 2) We need SDKs that support typescript/javascript both browser and node
> > > > and supports both creating and consuming arrow.
> > > >
> > > >
> > > >
> > > > Now that Apache Arrow is more mature and stabilized i.e. the schema and
> > > > sdks are post 1.x we are looking into it again.
> > > >
> > > >
> > > >
> > > >    1. we are thinking of defining specific metadata in a similar way we
> > > >    do for STACH that let’s us define some rendering specific e.g. adding
> > > a
> > > >    metadata to a Field Schema called isHidden to denote whether we 
> > > > should
> > > >    render the data column or not.
> > > >    2. It seems like there is a well developed javascript SDK that we can
> > > >    use. I am still reading the source code and the Observable articles 
> > > > to
> > > >    truly understand how it works.
> > > >       1. I read one of the issues is that the JS library might be out
> > > >       sync, so do people know how actively that repo is maintained.
> > > >       2. If there needs to be work done I think we would be able to help
> > > >       if we had some help getting started with understanding that repo.
> > > >
> > > >
> > > >
> > > > If possible we would be interested to continue to chat about the above
> > > > ideas, get more information about if Apache Arrow is right for the job,
> > > and
> > > > if there is already discussion of other people are using arrow for
> > > > rendering in addition to analytics.
> > > >
> > > >
> > > >
> > > > To clarify what I mean for existing render technologies I know stuff 
> > > > like
> > > > Falcon and Perspective exist, but those seem to be for basic table
> > > > rendering for simple tables. I mean to create a superset of arrow by
> > > > definfing metadata that allows for complex nested headers and nested
> > > rows.
> > > > Something like the image below. Then you can imagine even more data
> > > > attached such as describing the data and relationships to other data on
> > > the
> > > > page. You can image in the dataset there is some `personId` that is set
> > > to
> > > > not be rendered. This personId can then be used to gather more
> > > information
> > > > in another api call if you wanted to render a tooltip with maybe some 
> > > > bio
> > > > information. In short, rendered tables require a lot more information
> > > than
> > > > just the data. Does it make sense to build this upon Arrow.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -Thanks
> > > >
> > > > Michael
> > > >
> > > >
> > > >
> > >

Re: [JS] Exploring usage of apache arrow at my company for complex table rendering

Reply via email to