Re: data source v2 online meetup

2018-02-02 Thread Jacek Laskowski
Hi Reynold,

That in general is a very good idea to get the community engaged (even if
most people would just listen / hide in the dark like myself). I know no
other open source project at ASF or elsewhere that such an initiative was
even tried. Kudos for the idea!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Wed, Jan 31, 2018 at 11:35 PM, Reynold Xin  wrote:

> Data source v2 API is one of the larger main changes in Spark 2.3, and
> whatever that has already been committed is only the first version and we'd
> need more work post-2.3 to improve and stablize it.
>
> I think at this point we should stop making changes to it in branch-2.3,
> and instead focus on using the existing API and getting feedback for 2.4.
> Would people be interested in doing an online hangout to discuss this,
> perhaps in the month of Feb?
>
> It'd be more productive if people attending the hangout have tried the API
> by implementing some new sources or porting an existing source over.
>
>
>


Re: data source v2 online meetup

2018-02-01 Thread Reynold Xin
Still would be good to join. We can also do an additional one in March to
give people more time.


On Thu, Feb 1, 2018 at 3:59 PM, Russell Spitzer 
wrote:

> I can try to do a quick scratch implementation to see how the connector
> fits in, but we are in the middle of release land so I don't have the
> amount of time I really need to think about this. I'd be glad to join any
> hangout to discuss everything though.
>
> On Thu, Feb 1, 2018 at 11:15 AM Ryan Blue  wrote:
>
>> We don't mind updating Iceberg when the API improves. We are fully aware
>> that this is a very early implementation and will change. My hope is that
>> the community is receptive to our suggestions.
>>
>> A good example of an area with friction is filter and projection
>> push-down. The implementation for DSv2 isn't based on what the other read
>> paths do, it is a brand new and mostly untested. I don't really understand
>> why DSv2 introduced a new code path, when reusing existing code for this
>> ended up being smaller and works for more cases (see my comments on
>> #20476 <https://github.com/apache/spark/pull/20476>). I understand
>> wanting to fix parts of push-down, just not why it is a good idea to mix
>> that substantial change into an unrelated API update. This is one area
>> where, I hope, our suggestion to get DSv2 working well and redesign
>> push-down as a parallel effort is heard.
>>
>> I also see a few areas where the integration of DSv2 conflicts with what
>> I understand to be design principles of the catalyst optimizer. The fact
>> that it should use immutable nodes in plans is mostly settled, but there
>> are other examples. The approach of the new push-down implementation fights
>> against the principle of small rules that don't need to process the entire
>> plan tree. I think this makes the component brittle, and I'd like to
>> understand the rationale for going with this design. I'd love to see a
>> design document that covers why this is a necessary choice (but again,
>> separately).
>>
>> rb
>>
>> On Thu, Feb 1, 2018 at 9:10 AM, Felix Cheung 
>> wrote:
>>
>>> +1 hangout
>>>
>>> --
>>> *From:* Xiao Li 
>>> *Sent:* Wednesday, January 31, 2018 10:46:26 PM
>>> *To:* Ryan Blue
>>> *Cc:* Reynold Xin; dev; Wenchen Fen; Russell Spitzer
>>> *Subject:* Re: data source v2 online meetup
>>>
>>> Hi, Ryan,
>>>
>>> wow, your Iceberg already used data source V2 API! That is pretty cool!
>>> I am just afraid these new APIs are not stable. We might deprecate or
>>> change some data source v2 APIs in the next version (2.4). Sorry for the
>>> inconvenience it might introduce.
>>>
>>> Thanks for your feedback always,
>>>
>>> Xiao
>>>
>>>
>>> 2018-01-31 15:54 GMT-08:00 Ryan Blue :
>>>
>>>> Thanks for suggesting this, I think it's a great idea. I'll definitely
>>>> attend and can talk about the changes that we've made DataSourceV2 to
>>>> enable our new table format, Iceberg
>>>> <https://github.com/Netflix/iceberg#about-iceberg>.
>>>>
>>>> On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin 
>>>> wrote:
>>>>
>>>>> Data source v2 API is one of the larger main changes in Spark 2.3, and
>>>>> whatever that has already been committed is only the first version and 
>>>>> we'd
>>>>> need more work post-2.3 to improve and stablize it.
>>>>>
>>>>> I think at this point we should stop making changes to it in
>>>>> branch-2.3, and instead focus on using the existing API and getting
>>>>> feedback for 2.4. Would people be interested in doing an online hangout to
>>>>> discuss this, perhaps in the month of Feb?
>>>>>
>>>>> It'd be more productive if people attending the hangout have tried the
>>>>> API by implementing some new sources or porting an existing source over.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Re: data source v2 online meetup

2018-02-01 Thread Russell Spitzer
I can try to do a quick scratch implementation to see how the connector
fits in, but we are in the middle of release land so I don't have the
amount of time I really need to think about this. I'd be glad to join any
hangout to discuss everything though.

On Thu, Feb 1, 2018 at 11:15 AM Ryan Blue  wrote:

> We don't mind updating Iceberg when the API improves. We are fully aware
> that this is a very early implementation and will change. My hope is that
> the community is receptive to our suggestions.
>
> A good example of an area with friction is filter and projection
> push-down. The implementation for DSv2 isn't based on what the other read
> paths do, it is a brand new and mostly untested. I don't really understand
> why DSv2 introduced a new code path, when reusing existing code for this
> ended up being smaller and works for more cases (see my comments on #20476
> <https://github.com/apache/spark/pull/20476>). I understand wanting to
> fix parts of push-down, just not why it is a good idea to mix that
> substantial change into an unrelated API update. This is one area where, I
> hope, our suggestion to get DSv2 working well and redesign push-down as a
> parallel effort is heard.
>
> I also see a few areas where the integration of DSv2 conflicts with what I
> understand to be design principles of the catalyst optimizer. The fact that
> it should use immutable nodes in plans is mostly settled, but there are
> other examples. The approach of the new push-down implementation fights
> against the principle of small rules that don't need to process the entire
> plan tree. I think this makes the component brittle, and I'd like to
> understand the rationale for going with this design. I'd love to see a
> design document that covers why this is a necessary choice (but again,
> separately).
>
> rb
>
> On Thu, Feb 1, 2018 at 9:10 AM, Felix Cheung 
> wrote:
>
>> +1 hangout
>>
>> --
>> *From:* Xiao Li 
>> *Sent:* Wednesday, January 31, 2018 10:46:26 PM
>> *To:* Ryan Blue
>> *Cc:* Reynold Xin; dev; Wenchen Fen; Russell Spitzer
>> *Subject:* Re: data source v2 online meetup
>>
>> Hi, Ryan,
>>
>> wow, your Iceberg already used data source V2 API! That is pretty cool! I
>> am just afraid these new APIs are not stable. We might deprecate or change
>> some data source v2 APIs in the next version (2.4). Sorry for the
>> inconvenience it might introduce.
>>
>> Thanks for your feedback always,
>>
>> Xiao
>>
>>
>> 2018-01-31 15:54 GMT-08:00 Ryan Blue :
>>
>>> Thanks for suggesting this, I think it's a great idea. I'll definitely
>>> attend and can talk about the changes that we've made DataSourceV2 to
>>> enable our new table format, Iceberg
>>> <https://github.com/Netflix/iceberg#about-iceberg>.
>>>
>>> On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin 
>>> wrote:
>>>
>>>> Data source v2 API is one of the larger main changes in Spark 2.3, and
>>>> whatever that has already been committed is only the first version and we'd
>>>> need more work post-2.3 to improve and stablize it.
>>>>
>>>> I think at this point we should stop making changes to it in
>>>> branch-2.3, and instead focus on using the existing API and getting
>>>> feedback for 2.4. Would people be interested in doing an online hangout to
>>>> discuss this, perhaps in the month of Feb?
>>>>
>>>> It'd be more productive if people attending the hangout have tried the
>>>> API by implementing some new sources or porting an existing source over.
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: data source v2 online meetup

2018-02-01 Thread Ryan Blue
We don't mind updating Iceberg when the API improves. We are fully aware
that this is a very early implementation and will change. My hope is that
the community is receptive to our suggestions.

A good example of an area with friction is filter and projection push-down.
The implementation for DSv2 isn't based on what the other read paths do, it
is a brand new and mostly untested. I don't really understand why DSv2
introduced a new code path, when reusing existing code for this ended up
being smaller and works for more cases (see my comments on #20476
<https://github.com/apache/spark/pull/20476>). I understand wanting to fix
parts of push-down, just not why it is a good idea to mix that substantial
change into an unrelated API update. This is one area where, I hope, our
suggestion to get DSv2 working well and redesign push-down as a parallel
effort is heard.

I also see a few areas where the integration of DSv2 conflicts with what I
understand to be design principles of the catalyst optimizer. The fact that
it should use immutable nodes in plans is mostly settled, but there are
other examples. The approach of the new push-down implementation fights
against the principle of small rules that don't need to process the entire
plan tree. I think this makes the component brittle, and I'd like to
understand the rationale for going with this design. I'd love to see a
design document that covers why this is a necessary choice (but again,
separately).

rb

On Thu, Feb 1, 2018 at 9:10 AM, Felix Cheung 
wrote:

> +1 hangout
>
> --
> *From:* Xiao Li 
> *Sent:* Wednesday, January 31, 2018 10:46:26 PM
> *To:* Ryan Blue
> *Cc:* Reynold Xin; dev; Wenchen Fen; Russell Spitzer
> *Subject:* Re: data source v2 online meetup
>
> Hi, Ryan,
>
> wow, your Iceberg already used data source V2 API! That is pretty cool! I
> am just afraid these new APIs are not stable. We might deprecate or change
> some data source v2 APIs in the next version (2.4). Sorry for the
> inconvenience it might introduce.
>
> Thanks for your feedback always,
>
> Xiao
>
>
> 2018-01-31 15:54 GMT-08:00 Ryan Blue :
>
>> Thanks for suggesting this, I think it's a great idea. I'll definitely
>> attend and can talk about the changes that we've made DataSourceV2 to
>> enable our new table format, Iceberg
>> <https://github.com/Netflix/iceberg#about-iceberg>.
>>
>> On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin  wrote:
>>
>>> Data source v2 API is one of the larger main changes in Spark 2.3, and
>>> whatever that has already been committed is only the first version and we'd
>>> need more work post-2.3 to improve and stablize it.
>>>
>>> I think at this point we should stop making changes to it in branch-2.3,
>>> and instead focus on using the existing API and getting feedback for 2.4.
>>> Would people be interested in doing an online hangout to discuss this,
>>> perhaps in the month of Feb?
>>>
>>> It'd be more productive if people attending the hangout have tried the
>>> API by implementing some new sources or porting an existing source over.
>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: data source v2 online meetup

2018-02-01 Thread Felix Cheung
+1 hangout


From: Xiao Li 
Sent: Wednesday, January 31, 2018 10:46:26 PM
To: Ryan Blue
Cc: Reynold Xin; dev; Wenchen Fen; Russell Spitzer
Subject: Re: data source v2 online meetup

Hi, Ryan,

wow, your Iceberg already used data source V2 API! That is pretty cool! I am 
just afraid these new APIs are not stable. We might deprecate or change some 
data source v2 APIs in the next version (2.4). Sorry for the inconvenience it 
might introduce.

Thanks for your feedback always,

Xiao


2018-01-31 15:54 GMT-08:00 Ryan Blue 
mailto:rb...@netflix.com.invalid>>:
Thanks for suggesting this, I think it's a great idea. I'll definitely attend 
and can talk about the changes that we've made DataSourceV2 to enable our new 
table format, Iceberg<https://github.com/Netflix/iceberg#about-iceberg>.

On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
Data source v2 API is one of the larger main changes in Spark 2.3, and whatever 
that has already been committed is only the first version and we'd need more 
work post-2.3 to improve and stablize it.

I think at this point we should stop making changes to it in branch-2.3, and 
instead focus on using the existing API and getting feedback for 2.4. Would 
people be interested in doing an online hangout to discuss this, perhaps in the 
month of Feb?

It'd be more productive if people attending the hangout have tried the API by 
implementing some new sources or porting an existing source over.





--
Ryan Blue
Software Engineer
Netflix



Re: data source v2 online meetup

2018-01-31 Thread Xiao Li
Hi, Ryan,

wow, your Iceberg already used data source V2 API! That is pretty cool! I
am just afraid these new APIs are not stable. We might deprecate or change
some data source v2 APIs in the next version (2.4). Sorry for the
inconvenience it might introduce.

Thanks for your feedback always,

Xiao


2018-01-31 15:54 GMT-08:00 Ryan Blue :

> Thanks for suggesting this, I think it's a great idea. I'll definitely
> attend and can talk about the changes that we've made DataSourceV2 to
> enable our new table format, Iceberg
> .
>
> On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin  wrote:
>
>> Data source v2 API is one of the larger main changes in Spark 2.3, and
>> whatever that has already been committed is only the first version and we'd
>> need more work post-2.3 to improve and stablize it.
>>
>> I think at this point we should stop making changes to it in branch-2.3,
>> and instead focus on using the existing API and getting feedback for 2.4.
>> Would people be interested in doing an online hangout to discuss this,
>> perhaps in the month of Feb?
>>
>> It'd be more productive if people attending the hangout have tried the
>> API by implementing some new sources or porting an existing source over.
>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: data source v2 online meetup

2018-01-31 Thread Ryan Blue
Thanks for suggesting this, I think it's a great idea. I'll definitely
attend and can talk about the changes that we've made DataSourceV2 to
enable our new table format, Iceberg
.

On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin  wrote:

> Data source v2 API is one of the larger main changes in Spark 2.3, and
> whatever that has already been committed is only the first version and we'd
> need more work post-2.3 to improve and stablize it.
>
> I think at this point we should stop making changes to it in branch-2.3,
> and instead focus on using the existing API and getting feedback for 2.4.
> Would people be interested in doing an online hangout to discuss this,
> perhaps in the month of Feb?
>
> It'd be more productive if people attending the hangout have tried the API
> by implementing some new sources or porting an existing source over.
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix