I looked at the updated SPIP and I think the reduced scope sounds better. >From the Spark Summit, it seemed like there was a lot of interest in columnar processing and this would be a good starting point to enable that. It would be great to hear some other peoples input too.
Bryan On Tue, Apr 30, 2019 at 7:21 AM Bobby Evans <bo...@apache.org> wrote: > I wanted to give everyone a heads up that I have updated the SPIP at > https://issues.apache.org/jira/browse/SPARK-27396 Please take a look and > add any comments you might have to the JIRA. I reduced the scope of the > SPIP to just the non-controversial parts. In the background, I will be > trying to work with the Arrow community to get some form of guarantees > about the stability of the standard. That should hopefully unblock stable > APIs so end users can write columnar UDFs in scala/java and ideally get > efficient Arrow based batch data transfers to external tools as well. > > Thanks, > > Bobby > > On Tue, Apr 23, 2019 at 12:32 PM Matei Zaharia <matei.zaha...@gmail.com> > wrote: > > > Just as a note here, if the goal is the format not change, why not make > > that explicit in a versioning policy? You can always include a format > > version number and say that future versions may increment the number, but > > this specific version will always be readable in some specific way. You > > could also put a timeline on how long old version numbers will be > > recognized in the official libraries (e.g. 3 years). > > > > Matei > > > > > On Apr 22, 2019, at 6:36 AM, Bobby Evans <reva...@gmail.com> wrote: > > > > > > Yes, it is technically possible for the layout to change. No, it is > not > > going to happen. It is already baked into several different official > > libraries which are widely used, not just for holding and processing the > > data, but also for transfer of the data between the various > > implementations. There would have to be a really serious reason to force > > an incompatible change at this point. So in the worst case, we can > version > > the layout and bake that into the API that exposes the internal layout of > > the data. That way code that wants to program against a JAVA API can do > so > > using the API that Spark provides, those who want to interface with > > something that expects the data in arrow format will already have to know > > what version of the format it was programmed against and in the worst > case > > if the layout does change we can support the new layout if needed. > > > > > > On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cutl...@gmail.com> > wrote: > > > The Arrow data format is not yet stable, meaning there are no > guarantees > > on backwards/forwards compatibility. Once version 1.0 is released, it > will > > have those guarantees but it's hard to say when that will be. The > remaining > > work to get there can be seen at > > > https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone > . > > So yes, it is a risk that exposing Spark data as Arrow could cause an > issue > > if handled by a different version that is not compatible. That being > said, > > changes to format are not taken lightly and are backwards compatible when > > possible. I think it would be fair to mark the APIs exposing Arrow data > as > > experimental for the time being, and clearly state the version that must > be > > used to be compatible in the docs. Also, adding features like this and > > SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 > > release. Adding the Arrow dev list to CC. > > > > > > Bryan > > > > > > On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <matei.zaha...@gmail.com > > > > wrote: > > > Okay, that makes sense, but is the Arrow data format stable? If not, we > > risk breakage when Arrow changes in the future and some libraries using > > this feature are begin to use the new Arrow code. > > > > > > Matei > > > > > > > On Apr 20, 2019, at 1:39 PM, Bobby Evans <reva...@gmail.com> wrote: > > > > > > > > I want to be clear that this SPIP is not proposing exposing Arrow > > APIs/Classes through any Spark APIs. SPARK-24579 is doing that, and > > because of the overlap between the two SPIPs I scaled this one back to > > concentrate just on the columnar processing aspects. Sorry for the > > confusion as I didn't update the JIRA description clearly enough when we > > adjusted it during the discussion on the JIRA. As part of the columnar > > processing, we plan on providing arrow formatted data, but that will be > > exposed through a Spark owned API. > > > > > > > > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia < > matei.zaha...@gmail.com> > > wrote: > > > > FYI, I’d also be concerned about exposing the Arrow API or format as > a > > public API if it’s not yet stable. Is stabilization of the API and format > > coming soon on the roadmap there? Maybe someone can work with the Arrow > > community to make that happen. > > > > > > > > We’ve been bitten lots of times by API changes forced by external > > libraries even when those were widely popular. For example, we used > Guava’s > > Optional for a while, which changed at some point, and we also had issues > > with Protobuf and Scala itself (especially how Scala’s APIs appear in > > Java). API breakage might not be as serious in dynamic languages like > > Python, where you can often keep compatibility with old behaviors, but it > > really hurts in Java and Scala. > > > > > > > > The problem is especially bad for us because of two aspects of how > > Spark is used: > > > > > > > > 1) Spark is used for production data transformation jobs that people > > need to keep running for a long time. Nobody wants to make changes to a > job > > that’s been working fine and computing something correctly for years just > > to get a bug fix from the latest Spark release or whatever. It’s much > > better if they can upgrade Spark without editing every job. > > > > > > > > 2) Spark is often used as “glue” to combine data processing code in > > other libraries, and these might start to require different versions of > our > > dependencies. For example, the Guava class exposed in Spark became a > > problem when third-party libraries started requiring a new version of > > Guava: those new libraries just couldn’t work with Spark. Protobuf was > > especially bad because some users wanted to read data stored as Protobufs > > (or in a format that uses Protobuf inside), so they needed a different > > version of the library in their main data processing code. > > > > > > > > If there was some guarantee that this stuff would remain > > backward-compatible, we’d be in a much better stuff. It’s not that hard > to > > keep a storage format backward-compatible: just document the format and > > extend it only in ways that don’t break the meaning of old data (for > > example, add new version numbers or field types that are read in a > > different way). It’s a bit harder for a Java API, but maybe Spark could > > just expose byte arrays directly and work on those if the API is not > > guaranteed to stay stable (that is, we’d still use our own classes to > > manipulate the data internally, and end users could use the Arrow library > > if they want it). > > > > > > > > Matei > > > > > > > > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <reva...@gmail.com> > wrote: > > > > > > > > > > I think you misunderstood the point of this SPIP. I responded to > > your comments in the SPIP JIRA. > > > > > > > > > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <men...@gmail.com> > > wrote: > > > > > I posted my comment in the JIRA. Main concerns here: > > > > > > > > > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might > > have 1.0 release someday. > > > > > 2. ML/DL systems that can benefits from columnar format are mostly > > in Python. > > > > > 3. Simple operations, though benefits vectorization, might not be > > worth the data exchange overhead. > > > > > > > > > > So would an improved Pandas UDF API would be good enough? For > > example, SPARK-26412 (UDF that takes an iterator of of Arrow batches). > > > > > > > > > > Sorry that I should join the discussion earlier! Hope it is not too > > late:) > > > > > > > > > > On Fri, Apr 19, 2019 at 1:20 PM <tcon...@gmail.com> wrote: > > > > > +1 (non-binding) for better columnar data processing support. > > > > > > > > > > > > > > > > > > > > From: Jules Damji <dmat...@comcast.net> > > > > > Sent: Friday, April 19, 2019 12:21 PM > > > > > To: Bryan Cutler <cutl...@gmail.com> > > > > > Cc: Dev <dev@spark.apache.org> > > > > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended > > Columnar Processing Support > > > > > > > > > > > > > > > > > > > > + (non-binding) > > > > > > > > > > Sent from my iPhone > > > > > > > > > > Pardon the dumb thumb typos :) > > > > > > > > > > > > > > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cutl...@gmail.com> > > wrote: > > > > > > > > > > +1 (non-binding) > > > > > > > > > > > > > > > > > > > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> > > wrote: > > > > > > > > > > +1 (non-binding). Looking forward to seeing better support for > > processing columnar data. > > > > > > > > > > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves > > <tgraves...@yahoo.com.invalid> wrote: > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for > > extended Columnar Processing Support. The proposal is to extend the > > support to allow for more columnar processing. > > > > > > > > > > > > > > > > > > > > You can find the full proposal in the jira at: > > https://issues.apache.org/jira/browse/SPARK-27396. There was also a > > DISCUSS thread in the dev mailing list. > > > > > > > > > > > > > > > > > > > > Please vote as early as you can, I will leave the vote open until > > next Monday (the 22nd), 2pm CST to give people plenty of time. > > > > > > > > > > > > > > > > > > > > [ ] +1: Accept the proposal as an official SPIP > > > > > > > > > > [ ] +0 > > > > > > > > > > [ ] -1: I don't think this is a good idea because ... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > Tom Graves > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > > > >