Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Tom Graves Mon, 22 Apr 2019 11:48:53 -0700

 Ok, I'm cancelling the vote for now then and we will make some updates to the 
SPIP to try to clarify.
Tom
    On Monday, April 22, 2019, 1:07:25 PM CDT, Reynold Xin 
<r...@databricks.com> wrote:  
 
 "if others think it would be helpful, we can cancel this vote, update the SPIP 
to clarify exactly what I am proposing, and then restart the vote after we have 
gotten more agreement on what APIs should be exposed"


That'd be very useful. At least I was confused by what the SPIP was about. No 
point voting on something when there is still a lot of confusion about what it 
is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans <reva...@gmail.com> wrote:


Xiangrui Meng,

I provided some examples in the original discussion thread.

https:/ / lists. apache. org/ thread. html/ 
f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@ %3Cdev. spark. 
apache. org%3E

But the concrete use case that we have is GPU accelerated ETL on 
Spark.Primarily as data preparation and feature engineering for ML tools 
likeXGBoost, which by the way exposes a Spark specific scala API, not just 
apython one. We built a proof of concept and saw decent performance 
gains.Enough gains to more than pay for the added cost of a GPU, with 
thepotential for even better performance in the future. With that proof 
ofconcept, we were able to make all of the processing columnar end-to-end 
formany queries so there really wasn't any data conversion costs to 
overcome,but we did want the design flexible enough to include a 
cost-basedoptimizer. \

It looks like there is some confusion around this SPIP especially in how 
itrelates to features in other SPIPs around data exchange between 
differentsystems. I didn't want to update the text of this SPIP while it was 
underan active vote, but if others think it would be helpful, we can cancel 
thisvote, update the SPIP to clarify exactly what I am proposing, and 
thenrestart the vote after we have gotten more agreement on what APIs should 
beexposed.

Thanks,

Bobby

On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng <mengxr@ gmail. com> wrote:


Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. Ithink 
the SPIP should list a concrete ETL use case (from POC?) that canbenefit from 
this *public Java/Scala API, *does *vectorization*, andsignificantly *boosts 
the performance *even with data conversion overhead.

The current mid-term success (Pandas UDF) doesn't match the purpose ofSPIP and 
it can be done without exposing any public APIs.

Depending how much benefit it brings, we might agree that a publicJava/Scala 
API is needed. Then we might want to step slightly into how. Isaw three options 
mentioned in the JIRA and discussion threads:

1. Expose `Array[Byte]` in Arrow format. Let user decode it using an 
Arrowlibrary.
2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also usedby 
Spark internals. It makes us hard to change Spark internals in thefuture.
4. Expose something like `SparkRecordBatch` that is Arrow-compatible 
andmaintain conversion between internal `ColumnarBatch` and
`SparkRecordBatch`. It might cause conversion overhead in the future if 
ourinternal becomes different from Arrow.

Note that both 3 and 4 will make many APIs public to be Arrow compatible.So we 
should really give concrete ETL cases to prove that it is importantfor us to do 
so.

On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tgraves_cs@ yahoo. com> wrote:


Based on there is still discussion and Spark Summit is this week, I'mgoing to 
extend the vote til Friday the 26th.

Tom
On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <revans2@ gmail. 
com>wrote:

Yes, it is technically possible for the layout to change. No, it is notgoing to 
happen. It is already baked into several different officiallibraries which are 
widely used, not just for holding and processing thedata, but also for transfer 
of the data between the variousimplementations. There would have to be a really 
serious reason to forcean incompatible change at this point. So in the worst 
case, we can versionthe layout and bake that into the API that exposes the 
internal layout ofthe data. That way code that wants to program against a JAVA 
API can do sousing the API that Spark provides, those who want to interface 
withsomething that expects the data in arrow format will already have to 
knowwhat version of the format it was programmed against and in the worst 
caseif the layout does change we can support the new layout if needed.

On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cutlerb@ gmail. com> wrote:

The Arrow data format is not yet stable, meaning there are no guaranteeson 
backwards/forwards compatibility. Once version 1.0 is released, it willhave 
those guarantees but it's hard to say when that will be. The remainingwork to 
get there can be seen at
https:/ / cwiki. apache. org/ confluence/ display/ ARROW/ Columnar+Format+1. 
0+Milestone.So yes, it is a risk that exposing Spark data as Arrow could cause 
an issueif handled by a different version that is not compatible. That being 
said,changes to format are not taken lightly and are backwards compatible 
whenpossible. I think it would be fair to mark the APIs exposing Arrow data 
asexperimental for the time being, and clearly state the version that must 
beused to be compatible in the docs. Also, adding features like this 
andSPARK-24579 will probably help adoption of Arrow and accelerate a 
1.0release. Adding the Arrow dev list to CC.

Bryan

On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <matei. zaharia@ gmail. com>wrote:

Okay, that makes sense, but is the Arrow data format stable? If not, werisk 
breakage when Arrow changes in the future and some libraries usingthis feature 
are begin to use the new Arrow code.

Matei


On Apr 20, 2019, at 1:39 PM, Bobby Evans <revans2@ gmail. com> wrote:

I want to be clear that this SPIP is not proposing exposing Arrow


APIs/Classes through any Spark APIs. SPARK-24579 is doing that, andbecause of 
the overlap between the two SPIPs I scaled this one back toconcentrate just on 
the columnar processing aspects. Sorry for theconfusion as I didn't update the 
JIRA description clearly enough when weadjusted it during the discussion on the 
JIRA. As part of the columnarprocessing, we plan on providing arrow formatted 
data, but that will beexposed through a Spark owned API.


On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <matei. zaharia@ gmail. com>


wrote:


FYI, I’d also be concerned about exposing the Arrow API or format as a


public API if it’s not yet stable. Is stabilization of the API and formatcoming 
soon on the roadmap there? Maybe someone can work with the Arrowcommunity to 
make that happen.


We’ve been bitten lots of times by API changes forced by external


libraries even when those were widely popular. For example, we used 
Guava’sOptional for a while, which changed at some point, and we also had 
issueswith Protobuf and Scala itself (especially how Scala’s APIs appear 
inJava). API breakage might not be as serious in dynamic languages likePython, 
where you can often keep compatibility with old behaviors, but itreally hurts 
in Java and Scala.


The problem is especially bad for us because of two aspects of how


Spark is used:


1) Spark is used for production data transformation jobs that people


need to keep running for a long time. Nobody wants to make changes to a 
jobthat’s been working fine and computing something correctly for years justto 
get a bug fix from the latest Spark release or whatever. It’s muchbetter if 
they can upgrade Spark without editing every job.


2) Spark is often used as “glue” to combine data processing code in


other libraries, and these might start to require different versions of 
ourdependencies. For example, the Guava class exposed in Spark became aproblem 
when third-party libraries started requiring a new version ofGuava: those new 
libraries just couldn’t work with Spark. Protobuf wasespecially bad because 
some users wanted to read data stored as Protobufs
(or in a format that uses Protobuf inside), so they needed a differentversion 
of the library in their main data processing code.


If there was some guarantee that this stuff would remain


backward-compatible, we’d be in a much better stuff. It’s not that hard tokeep 
a storage format backward-compatible: just document the format andextend it 
only in ways that don’t break the meaning of old data (forexample, add new 
version numbers or field types that are read in adifferent way). It’s a bit 
harder for a Java API, but maybe Spark couldjust expose byte arrays directly 
and work on those if the API is notguaranteed to stay stable (that is, we’d 
still use our own classes tomanipulate the data internally, and end users could 
use the Arrow libraryif they want it).


Matei


On Apr 20, 2019, at 8:38 AM, Bobby Evans <revans2@ gmail. com> wrote:

I think you misunderstood the point of this SPIP. I responded to your



comments in the SPIP JIRA.



On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <mengxr@ gmail. com>



wrote:



I posted my comment in the JIRA. Main concerns here:

1. Exposing third-party Java APIs in Spark is risky. Arrow might have



1.0 release someday.



2. ML/DL systems that can benefits from columnar format are mostly in



Python.



3. Simple operations, though benefits vectorization, might not be



worth the data exchange overhead.



So would an improved Pandas UDF API would be good enough? For



example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).



Sorry that I should join the discussion earlier! Hope it is not too



late:)



On Fri, Apr 19, 2019 at 1:20 PM <tcondie@ gmail. com> wrote:
+1 (non-binding) for better columnar data processing support.

From: Jules Damji <dmatrix@ comcast. net>
Sent: Friday, April 19, 2019 12:21 PM
To: Bryan Cutler <cutlerb@ gmail. com>
Cc: Dev <dev@ spark. apache. org>
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended



Columnar Processing Support



+ (non-binding)

Sent from my iPhone

Pardon the dumb thumb typos :)

On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cutlerb@ gmail. com> wrote:

+1 (non-binding)

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jlowe@ apache. org> wrote:

+1 (non-binding). Looking forward to seeing better support for



processing columnar data.



Jason

On Tue, Apr 16, 2019 at 10:38 AM Tom Graves



<tgraves_cs@ yahoo. com. invalid> wrote:



Hi everyone,

I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for



extended Columnar Processing Support. The proposal is to extend thesupport to 
allow for more columnar processing.



You can find the full proposal in the jira at:



https:/ / issues. apache. org/ jira/ browse/ SPARK-27396. There was also 
aDISCUSS thread in the dev mailing list.



Please vote as early as you can, I will leave the vote open until



next Monday (the 22nd), 2pm CST to give people plenty of time.



[ ] +1: Accept the proposal as an official SPIP

[ ] +0

[ ] -1: I don't think this is a good idea because ...

Thanks!

Tom Graves



---------------------------------------------------------------------To 
unsubscribe e-mail: dev-unsubscribe@ spark. apache. org

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Reply via email to