Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Tom Graves Mon, 22 Apr 2019 08:28:20 -0700

 
Based on there is still discussion and Spark Summit is this week, I'm going to 
extend the vote til Friday the 26th.
Tom    On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans 
<[email protected]> wrote:  
 
 Yes, it is technically possible for the layout to change.  No, it is not going 
to happen.  It is already baked into several different official libraries which 
are widely used, not just for holding and processing the data, but also for 
transfer of the data between the various implementations.  There would have to 
be a really serious reason to force an incompatible change at this point.  So 
in the worst case, we can version the layout and bake that into the API that 
exposes the internal layout of the data.  That way code that wants to program 
against a JAVA API can do so using the API that Spark provides, those who want 
to interface with something that expects the data in arrow format will already 
have to know what version of the format it was programmed against and in the 
worst case if the layout does change we can support the new layout if needed.
On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <[email protected]> wrote:


The Arrow data format is not yet stable, meaning there are no guarantees on 
backwards/forwards compatibility. Once version 1.0 is released, it will have 
those guarantees but it's hard to say when that will be. The remaining work to 
get there can be seen at 
https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
 So yes, it is a risk that exposing Spark data as Arrow could cause an issue if 
handled by a different version that is not compatible. That being said, changes 
to format are not taken lightly and are backwards compatible when possible. I 
think it would be fair to mark the APIs exposing Arrow data as experimental for 
the time being, and clearly state the version that must be used to be 
compatible in the docs. Also, adding features like this and SPARK-24579 will 
probably help adoption of Arrow and accelerate a 1.0 release. Adding the Arrow 
dev list to CC.
Bryan

On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <[email protected]> wrote:

Okay, that makes sense, but is the Arrow data format stable? If not, we risk 
breakage when Arrow changes in the future and some libraries using this feature 
are begin to use the new Arrow code.

Matei

> On Apr 20, 2019, at 1:39 PM, Bobby Evans <[email protected]> wrote:
> 
> I want to be clear that this SPIP is not proposing exposing Arrow 
> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and because 
> of the overlap between the two SPIPs I scaled this one back to concentrate 
> just on the columnar processing aspects. Sorry for the confusion as I didn't 
> update the JIRA description clearly enough when we adjusted it during the 
> discussion on the JIRA.  As part of the columnar processing, we plan on 
> providing arrow formatted data, but that will be exposed through a Spark 
> owned API.
> 
> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <[email protected]> wrote:
> FYI, I’d also be concerned about exposing the Arrow API or format as a public 
> API if it’s not yet stable. Is stabilization of the API and format coming 
> soon on the roadmap there? Maybe someone can work with the Arrow community to 
> make that happen.
> 
> We’ve been bitten lots of times by API changes forced by external libraries 
> even when those were widely popular. For example, we used Guava’s Optional 
> for a while, which changed at some point, and we also had issues with 
> Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API 
> breakage might not be as serious in dynamic languages like Python, where you 
> can often keep compatibility with old behaviors, but it really hurts in Java 
> and Scala.
> 
> The problem is especially bad for us because of two aspects of how Spark is 
> used:
> 
> 1) Spark is used for production data transformation jobs that people need to 
> keep running for a long time. Nobody wants to make changes to a job that’s 
> been working fine and computing something correctly for years just to get a 
> bug fix from the latest Spark release or whatever. It’s much better if they 
> can upgrade Spark without editing every job.
> 
> 2) Spark is often used as “glue” to combine data processing code in other 
> libraries, and these might start to require different versions of our 
> dependencies. For example, the Guava class exposed in Spark became a problem 
> when third-party libraries started requiring a new version of Guava: those 
> new libraries just couldn’t work with Spark. Protobuf was especially bad 
> because some users wanted to read data stored as Protobufs (or in a format 
> that uses Protobuf inside), so they needed a different version of the library 
> in their main data processing code.
> 
> If there was some guarantee that this stuff would remain backward-compatible, 
> we’d be in a much better stuff. It’s not that hard to keep a storage format 
> backward-compatible: just document the format and extend it only in ways that 
> don’t break the meaning of old data (for example, add new version numbers or 
> field types that are read in a different way). It’s a bit harder for a Java 
> API, but maybe Spark could just expose byte arrays directly and work on those 
> if the API is not guaranteed to stay stable (that is, we’d still use our own 
> classes to manipulate the data internally, and end users could use the Arrow 
> library if they want it).
> 
> Matei
> 
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans <[email protected]> wrote:
> > 
> > I think you misunderstood the point of this SPIP. I responded to your 
> > comments in the SPIP JIRA.
> > 
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <[email protected]> wrote:
> > I posted my comment in the JIRA. Main concerns here:
> > 
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> > release someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in 
> > Python.
> > 3. Simple operations, though benefits vectorization, might not be worth the 
> > data exchange overhead.
> > 
> > So would an improved Pandas UDF API would be good enough? For example, 
> > SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > 
> > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > 
> > On Fri, Apr 19, 2019 at 1:20 PM <[email protected]> wrote:
> > +1 (non-binding) for better columnar data processing support.
> > 
> >  
> > 
> > From: Jules Damji <[email protected]> 
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler <[email protected]>
> > Cc: Dev <[email protected]>
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> > Processing Support
> > 
> >  
> > 
> > + (non-binding)
> > 
> > Sent from my iPhone
> > 
> > Pardon the dumb thumb typos :)
> > 
> > 
> > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <[email protected]> wrote:
> > 
> > +1 (non-binding)
> > 
> >  
> > 
> > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <[email protected]> wrote:
> > 
> > +1 (non-binding).  Looking forward to seeing better support for processing 
> > columnar data.
> > 
> >  
> > 
> > Jason
> > 
> >  
> > 
> > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves <[email protected]> 
> > wrote:
> > 
> > Hi everyone,
> > 
> >  
> > 
> > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
> > Columnar Processing Support.  The proposal is to extend the support to 
> > allow for more columnar processing.
> > 
> >  
> > 
> > You can find the full proposal in the jira at: 
> > https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
> > thread in the dev mailing list.
> > 
> >  
> > 
> > Please vote as early as you can, I will leave the vote open until next 
> > Monday (the 22nd), 2pm CST to give people plenty of time.
> > 
> >  
> > 
> > [ ] +1: Accept the proposal as an official SPIP
> > 
> > [ ] +0
> > 
> > [ ] -1: I don't think this is a good idea because ...
> > 
> >  
> > 
> >  
> > 
> > Thanks!
> > 
> > Tom Graves
> > 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Reply via email to