Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Bryan Cutler Thu, 02 May 2019 18:01:30 -0700

I looked at the updated SPIP and I think the reduced scope sounds better.
>From the Spark Summit, it seemed like there was a lot of interest in
columnar processing and this would be a good starting point to enable that.
It would be great to hear some other peoples input too.


Bryan

On Tue, Apr 30, 2019 at 7:21 AM Bobby Evans <[email protected]> wrote:

> I wanted to give everyone a heads up that I have updated the SPIP at
> https://issues.apache.org/jira/browse/SPARK-27396 Please take a look and
> add any comments you might have to the JIRA.  I reduced the scope of the
> SPIP to just the non-controversial parts.  In the background, I will be
> trying to work with the Arrow community to get some form of guarantees
> about the stability of the standard.  That should hopefully unblock stable
> APIs so end users can write columnar UDFs in scala/java and ideally get
> efficient Arrow based batch data transfers to external tools as well.
>
> Thanks,
>
> Bobby
>
> On Tue, Apr 23, 2019 at 12:32 PM Matei Zaharia <[email protected]>
> wrote:
>
> > Just as a note here, if the goal is the format not change, why not make
> > that explicit in a versioning policy? You can always include a format
> > version number and say that future versions may increment the number, but
> > this specific version will always be readable in some specific way. You
> > could also put a timeline on how long old version numbers will be
> > recognized in the official libraries (e.g. 3 years).
> >
> > Matei
> >
> > > On Apr 22, 2019, at 6:36 AM, Bobby Evans <[email protected]> wrote:
> > >
> > > Yes, it is technically possible for the layout to change.  No, it is
> not
> > going to happen.  It is already baked into several different official
> > libraries which are widely used, not just for holding and processing the
> > data, but also for transfer of the data between the various
> > implementations.  There would have to be a really serious reason to force
> > an incompatible change at this point.  So in the worst case, we can
> version
> > the layout and bake that into the API that exposes the internal layout of
> > the data.  That way code that wants to program against a JAVA API can do
> so
> > using the API that Spark provides, those who want to interface with
> > something that expects the data in arrow format will already have to know
> > what version of the format it was programmed against and in the worst
> case
> > if the layout does change we can support the new layout if needed.
> > >
> > > On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <[email protected]>
> wrote:
> > > The Arrow data format is not yet stable, meaning there are no
> guarantees
> > on backwards/forwards compatibility. Once version 1.0 is released, it
> will
> > have those guarantees but it's hard to say when that will be. The
> remaining
> > work to get there can be seen at
> >
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone
> .
> > So yes, it is a risk that exposing Spark data as Arrow could cause an
> issue
> > if handled by a different version that is not compatible. That being
> said,
> > changes to format are not taken lightly and are backwards compatible when
> > possible. I think it would be fair to mark the APIs exposing Arrow data
> as
> > experimental for the time being, and clearly state the version that must
> be
> > used to be compatible in the docs. Also, adding features like this and
> > SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> > release. Adding the Arrow dev list to CC.
> > >
> > > Bryan
> > >
> > > On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <[email protected]
> >
> > wrote:
> > > Okay, that makes sense, but is the Arrow data format stable? If not, we
> > risk breakage when Arrow changes in the future and some libraries using
> > this feature are begin to use the new Arrow code.
> > >
> > > Matei
> > >
> > > > On Apr 20, 2019, at 1:39 PM, Bobby Evans <[email protected]> wrote:
> > > >
> > > > I want to be clear that this SPIP is not proposing exposing Arrow
> > APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
> > because of the overlap between the two SPIPs I scaled this one back to
> > concentrate just on the columnar processing aspects. Sorry for the
> > confusion as I didn't update the JIRA description clearly enough when we
> > adjusted it during the discussion on the JIRA.  As part of the columnar
> > processing, we plan on providing arrow formatted data, but that will be
> > exposed through a Spark owned API.
> > > >
> > > > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <
> [email protected]>
> > wrote:
> > > > FYI, I’d also be concerned about exposing the Arrow API or format as
> a
> > public API if it’s not yet stable. Is stabilization of the API and format
> > coming soon on the roadmap there? Maybe someone can work with the Arrow
> > community to make that happen.
> > > >
> > > > We’ve been bitten lots of times by API changes forced by external
> > libraries even when those were widely popular. For example, we used
> Guava’s
> > Optional for a while, which changed at some point, and we also had issues
> > with Protobuf and Scala itself (especially how Scala’s APIs appear in
> > Java). API breakage might not be as serious in dynamic languages like
> > Python, where you can often keep compatibility with old behaviors, but it
> > really hurts in Java and Scala.
> > > >
> > > > The problem is especially bad for us because of two aspects of how
> > Spark is used:
> > > >
> > > > 1) Spark is used for production data transformation jobs that people
> > need to keep running for a long time. Nobody wants to make changes to a
> job
> > that’s been working fine and computing something correctly for years just
> > to get a bug fix from the latest Spark release or whatever. It’s much
> > better if they can upgrade Spark without editing every job.
> > > >
> > > > 2) Spark is often used as “glue” to combine data processing code in
> > other libraries, and these might start to require different versions of
> our
> > dependencies. For example, the Guava class exposed in Spark became a
> > problem when third-party libraries started requiring a new version of
> > Guava: those new libraries just couldn’t work with Spark. Protobuf was
> > especially bad because some users wanted to read data stored as Protobufs
> > (or in a format that uses Protobuf inside), so they needed a different
> > version of the library in their main data processing code.
> > > >
> > > > If there was some guarantee that this stuff would remain
> > backward-compatible, we’d be in a much better stuff. It’s not that hard
> to
> > keep a storage format backward-compatible: just document the format and
> > extend it only in ways that don’t break the meaning of old data (for
> > example, add new version numbers or field types that are read in a
> > different way). It’s a bit harder for a Java API, but maybe Spark could
> > just expose byte arrays directly and work on those if the API is not
> > guaranteed to stay stable (that is, we’d still use our own classes to
> > manipulate the data internally, and end users could use the Arrow library
> > if they want it).
> > > >
> > > > Matei
> > > >
> > > > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <[email protected]>
> wrote:
> > > > >
> > > > > I think you misunderstood the point of this SPIP. I responded to
> > your comments in the SPIP JIRA.
> > > > >
> > > > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <[email protected]>
> > wrote:
> > > > > I posted my comment in the JIRA. Main concerns here:
> > > > >
> > > > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might
> > have 1.0 release someday.
> > > > > 2. ML/DL systems that can benefits from columnar format are mostly
> > in Python.
> > > > > 3. Simple operations, though benefits vectorization, might not be
> > worth the data exchange overhead.
> > > > >
> > > > > So would an improved Pandas UDF API would be good enough? For
> > example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > > >
> > > > > Sorry that I should join the discussion earlier! Hope it is not too
> > late:)
> > > > >
> > > > > On Fri, Apr 19, 2019 at 1:20 PM <[email protected]> wrote:
> > > > > +1 (non-binding) for better columnar data processing support.
> > > > >
> > > > >
> > > > >
> > > > > From: Jules Damji <[email protected]>
> > > > > Sent: Friday, April 19, 2019 12:21 PM
> > > > > To: Bryan Cutler <[email protected]>
> > > > > Cc: Dev <[email protected]>
> > > > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> > Columnar Processing Support
> > > > >
> > > > >
> > > > >
> > > > > + (non-binding)
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > Pardon the dumb thumb typos :)
> > > > >
> > > > >
> > > > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <[email protected]>
> > wrote:
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <[email protected]>
> > wrote:
> > > > >
> > > > > +1 (non-binding).  Looking forward to seeing better support for
> > processing columnar data.
> > > > >
> > > > >
> > > > >
> > > > > Jason
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
> > <[email protected]> wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > >
> > > > >
> > > > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> > extended Columnar Processing Support.  The proposal is to extend the
> > support to allow for more columnar processing.
> > > > >
> > > > >
> > > > >
> > > > > You can find the full proposal in the jira at:
> > https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> > DISCUSS thread in the dev mailing list.
> > > > >
> > > > >
> > > > >
> > > > > Please vote as early as you can, I will leave the vote open until
> > next Monday (the 22nd), 2pm CST to give people plenty of time.
> > > > >
> > > > >
> > > > >
> > > > > [ ] +1: Accept the proposal as an official SPIP
> > > > >
> > > > > [ ] +0
> > > > >
> > > > > [ ] -1: I don't think this is a good idea because ...
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Tom Graves
> > > > >
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe e-mail: [email protected]
> > >
> >
> >
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Reply via email to