[GitHub] gatorsmile commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
gatorsmile commented on a change in pull request #163: Announce the schedule of 
2019 Spark+AI summit at SF
URL: https://github.com/apache/spark-website/pull/163#discussion_r243158425
 
 

 ##
 File path: site/sitemap.xml
 ##
 @@ -139,657 +139,661 @@
 
 
 
-  https://spark.apache.org/releases/spark-release-2-4-0.html
+  
http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
 
 Review comment:
   Will update them later. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Noisy spark-website notifications

2018-12-19 Thread Reynold Xin
I added my comment there too!

On Wed, Dec 19, 2018 at 7:26 PM, Hyukjin Kwon < gurwls...@gmail.com > wrote:

> 
> Yea, that's a bit noisy .. I would just completely disable it to be
> honest. I failed https:/ / issues. apache. org/ jira/ browse/ INFRA-17469 (
> https://issues.apache.org/jira/browse/INFRA-17469 ) before. I would
> appreciate if there would be more inputs there :-)
> 
> 
> 2018년 12월 20일 (목) 오전 11:22, Nicholas Chammas < nicholas. chammas@ gmail. com
> ( nicholas.cham...@gmail.com ) >님이 작성:
> 
> 
>> I'd prefer it if we disabled all git notifications for spark-website.
>> Folks who want to stay on top of what's happening with the site can simply
>> watch the repo on GitHub ( https://github.com/apache/spark-website ) , no?
>> 
>> 
>> On Wed, Dec 19, 2018 at 10:00 PM Wenchen Fan < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> 
>> 
>>> +1, at least it should only send one email when a PR is merged.
>>> 
>>> 
>>> On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas < nicholas. chammas@ 
>>> gmail.
>>> com ( nicholas.cham...@gmail.com ) > wrote:
>>> 
>>> 
 Can we somehow disable these new email alerts coming through for the Spark
 website repo?
 
 On Wed, Dec 19, 2018 at 8:25 PM GitBox < git@ apache. org ( g...@apache.org
 ) > wrote:
 
 
> ueshin commented on a change in pull request #163: Announce the schedule
> of 2019 Spark+AI summit at SF
> URL: https:/ / github. com/ apache/ spark-website/ pull/ 
> 163#discussion_r243130975
> ( https://github.com/apache/spark-website/pull/163#discussion_r243130975 )
> 
> 
> 
> 
>  ##
>  File path: site/sitemap.xml
>  ##
>  @@ -139,657 +139,661 @@
>  
>  
>  
> -   https:/ / spark. apache. org/ releases/ spark-release-2-4-0. html
> ( https://spark.apache.org/releases/spark-release-2-4-0.html ) 
> +   http:/ / localhost:4000/ news/ 
> spark-ai-summit-apr-2019-agenda-posted.
> html (
> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html ) 
> 
> 
> 
>  Review comment:
>    Still remaining `localhost:4000` in this file.
> 
> 
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on GitHub and use the
> URL above to go to the specific comment.
> 
> For queries about this service, please contact Infrastructure at:
> users@ infra. apache. org ( us...@infra.apache.org )
> 
> 
> With regards,
> Apache Git Services
> 
> -
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
 
 
 
>>> 
>>> 
>> 
>> 
> 
>

Re: Noisy spark-website notifications

2018-12-19 Thread Hyukjin Kwon
Yea, that's a bit noisy .. I would just completely disable it to be honest.
I failed https://issues.apache.org/jira/browse/INFRA-17469 before. I would
appreciate if there would be more inputs there :-)

2018년 12월 20일 (목) 오전 11:22, Nicholas Chammas 님이
작성:

> I'd prefer it if we disabled all git notifications for spark-website.
> Folks who want to stay on top of what's happening with the site can simply 
> watch
> the repo on GitHub , no?
>
> On Wed, Dec 19, 2018 at 10:00 PM Wenchen Fan  wrote:
>
>> +1, at least it should only send one email when a PR is merged.
>>
>> On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Can we somehow disable these new email alerts coming through for the
>>> Spark website repo?
>>>
>>> On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:
>>>
 ueshin commented on a change in pull request #163: Announce the
 schedule of 2019 Spark+AI summit at SF
 URL:
 https://github.com/apache/spark-website/pull/163#discussion_r243130975



  ##
  File path: site/sitemap.xml
  ##
  @@ -139,657 +139,661 @@
  
  
  
 -  https://spark.apache.org/releases/spark-release-2-4-0.html
 
 +  
 http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
 

  Review comment:
Still remaining `localhost:4000` in this file.

 
 This is an automated message from the Apache Git Service.
 To respond to the message, please log on GitHub and use the
 URL above to go to the specific comment.

 For queries about this service, please contact Infrastructure at:
 us...@infra.apache.org


 With regards,
 Apache Git Services

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
So you agree with my proposal that we should follow RDBMS/SQL standard
regarding the behavior?

> pass the default through to the underlying data source

This is one way to implement the behavior.

On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue  wrote:

> I don't think we have to change the syntax. Isn't the right thing (for
> option 1) to pass the default through to the underlying data source?
> Sources that don't support defaults would throw an exception.
>
> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan  wrote:
>
>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN
>> column_name datatype [DEFAULT value];
>>
>> If the DEFAULT statement is not specified, then the default value is
>> null. If we are going to change the behavior and say the default value is
>> decided by the underlying data source, we should use a new SQL syntax(I
>> don't have a proposal in mind), instead of reusing the existing syntax, to
>> be SQL compatible.
>>
>> Personally I don't like re-invent wheels. It's better to just implement
>> the SQL standard ADD COLUMN command, which means the default value is
>> decided by the end-users.
>>
>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue  wrote:
>>
>>> Wenchen, can you give more detail about the different ADD COLUMN syntax?
>>> That sounds confusing to end users to me.
>>>
>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan  wrote:
>>>
 Note that the design we make here will affect both data source
 developers and end-users. It's better to provide reliable behaviors to
 end-users, instead of asking them to read the spec of the data source and
 know which value will be used for missing columns, when they write data.

 If we do want to go with the "data source decides default value"
 approach, we should create a new SQL syntax for ADD COLUMN, as its behavior
 is different from the SQL standard ADD COLUMN command.

 On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <
 russell.spit...@gmail.com> wrote:

> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want
> 2 is for a unified way of dealing with missing columns? I feel like that
> probably should be left up to the underlying datasource implementation. 
> For
> example if you have missing columns with a database the Datasource can
> choose a value based on the Database's metadata if such a thing exists, I
> don't think Spark should really have a this level of detail but I've also
> missed out on all of these meetings (sorry it's family dinner time :) ) so
> I may be missing something.
>
> So my tldr is, Let a datasource report whether or not missing columns
> are OK and let the Datasource deal with the missing data based on it's
> underlying storage.
>
> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan 
> wrote:
>
>> I agree that we should not rewrite existing parquet files when a new
>> column is added, but we should also try out best to make the behavior 
>> same
>> as RDBMS/SQL standard.
>>
>> 1. it should be the user who decides the default value of a column,
>> by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>> 2. When adding a new column, the default value should be effective
>> for all the existing data, and newly written data.
>> 3. When altering an existing column and change the default value, it
>> should be effective for newly written data only.
>>
>> A possible implementation:
>> 1. a columnn has 2 default values: the initial one and the latest one.
>> 2. when adding a column with a default value, set both the initial
>> one and the latest one to this value. But do not update existing data.
>> 3. when reading data, fill the missing column with the initial
>> default value
>> 4. when writing data, fill the missing column with the latest default
>> value
>> 5. when altering a column to change its default value, only update
>> the latest default value.
>>
>> This works because:
>> 1. new files will be written with the latest default value, nothing
>> we need to worry about at read time.
>> 2. old files will be read with the initial default value, which
>> returns expected result.
>>
>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> This thread is a follow-up to a discussion that we started in the
>>> DSv2 community sync last week.
>>>
>>> The problem I’m trying to solve is that the format I’m using DSv2 to
>>> integrate supports schema evolution. Specifically, adding a new optional
>>> column so that rows without that column get a default value (null for
>>> Iceberg). The current validation rule for an append in DSv2 fails a 
>>> write
>>> if it is missing a column, so adding a column to an existing table will
>>> cause currently-scheduled jobs that insert data to 

Re: Noisy spark-website notifications

2018-12-19 Thread Nicholas Chammas
I'd prefer it if we disabled all git notifications for spark-website. Folks
who want to stay on top of what's happening with the site can simply watch
the repo on GitHub , no?

On Wed, Dec 19, 2018 at 10:00 PM Wenchen Fan  wrote:

> +1, at least it should only send one email when a PR is merged.
>
> On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Can we somehow disable these new email alerts coming through for the
>> Spark website repo?
>>
>> On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:
>>
>>> ueshin commented on a change in pull request #163: Announce the schedule
>>> of 2019 Spark+AI summit at SF
>>> URL:
>>> https://github.com/apache/spark-website/pull/163#discussion_r243130975
>>>
>>>
>>>
>>>  ##
>>>  File path: site/sitemap.xml
>>>  ##
>>>  @@ -139,657 +139,661 @@
>>>  
>>>  
>>>  
>>> -  https://spark.apache.org/releases/spark-release-2-4-0.html
>>> +  
>>> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
>>> 
>>>
>>>  Review comment:
>>>Still remaining `localhost:4000` in this file.
>>>
>>> 
>>> This is an automated message from the Apache Git Service.
>>> To respond to the message, please log on GitHub and use the
>>> URL above to go to the specific comment.
>>>
>>> For queries about this service, please contact Infrastructure at:
>>> us...@infra.apache.org
>>>
>>>
>>> With regards,
>>> Apache Git Services
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
I don't think we have to change the syntax. Isn't the right thing (for
option 1) to pass the default through to the underlying data source?
Sources that don't support defaults would throw an exception.

On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan  wrote:

> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN
> column_name datatype [DEFAULT value];
>
> If the DEFAULT statement is not specified, then the default value is null.
> If we are going to change the behavior and say the default value is decided
> by the underlying data source, we should use a new SQL syntax(I don't have
> a proposal in mind), instead of reusing the existing syntax, to be SQL
> compatible.
>
> Personally I don't like re-invent wheels. It's better to just implement
> the SQL standard ADD COLUMN command, which means the default value is
> decided by the end-users.
>
> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue  wrote:
>
>> Wenchen, can you give more detail about the different ADD COLUMN syntax?
>> That sounds confusing to end users to me.
>>
>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan  wrote:
>>
>>> Note that the design we make here will affect both data source
>>> developers and end-users. It's better to provide reliable behaviors to
>>> end-users, instead of asking them to read the spec of the data source and
>>> know which value will be used for missing columns, when they write data.
>>>
>>> If we do want to go with the "data source decides default value"
>>> approach, we should create a new SQL syntax for ADD COLUMN, as its behavior
>>> is different from the SQL standard ADD COLUMN command.
>>>
>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2
 is for a unified way of dealing with missing columns? I feel like that
 probably should be left up to the underlying datasource implementation. For
 example if you have missing columns with a database the Datasource can
 choose a value based on the Database's metadata if such a thing exists, I
 don't think Spark should really have a this level of detail but I've also
 missed out on all of these meetings (sorry it's family dinner time :) ) so
 I may be missing something.

 So my tldr is, Let a datasource report whether or not missing columns
 are OK and let the Datasource deal with the missing data based on it's
 underlying storage.

 On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan 
 wrote:

> I agree that we should not rewrite existing parquet files when a new
> column is added, but we should also try out best to make the behavior same
> as RDBMS/SQL standard.
>
> 1. it should be the user who decides the default value of a column, by
> CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
> 2. When adding a new column, the default value should be effective for
> all the existing data, and newly written data.
> 3. When altering an existing column and change the default value, it
> should be effective for newly written data only.
>
> A possible implementation:
> 1. a columnn has 2 default values: the initial one and the latest one.
> 2. when adding a column with a default value, set both the initial one
> and the latest one to this value. But do not update existing data.
> 3. when reading data, fill the missing column with the initial default
> value
> 4. when writing data, fill the missing column with the latest default
> value
> 5. when altering a column to change its default value, only update the
> latest default value.
>
> This works because:
> 1. new files will be written with the latest default value, nothing we
> need to worry about at read time.
> 2. old files will be read with the initial default value, which
> returns expected result.
>
> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue 
> wrote:
>
>> Hi everyone,
>>
>> This thread is a follow-up to a discussion that we started in the
>> DSv2 community sync last week.
>>
>> The problem I’m trying to solve is that the format I’m using DSv2 to
>> integrate supports schema evolution. Specifically, adding a new optional
>> column so that rows without that column get a default value (null for
>> Iceberg). The current validation rule for an append in DSv2 fails a write
>> if it is missing a column, so adding a column to an existing table will
>> cause currently-scheduled jobs that insert data to start failing. 
>> Clearly,
>> schema evolution shouldn't break existing jobs that produce valid data.
>>
>> To fix this problem, I suggested option 1: adding a way for Spark to
>> check whether to fail when an optional column is missing. Other
>> contributors in the sync thought that Spark should go with option 2:
>> Spark’s schema should have 

Re: Noisy spark-website notifications

2018-12-19 Thread Wenchen Fan
+1, at least it should only send one email when a PR is merged.

On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Can we somehow disable these new email alerts coming through for the Spark
> website repo?
>
> On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:
>
>> ueshin commented on a change in pull request #163: Announce the schedule
>> of 2019 Spark+AI summit at SF
>> URL:
>> https://github.com/apache/spark-website/pull/163#discussion_r243130975
>>
>>
>>
>>  ##
>>  File path: site/sitemap.xml
>>  ##
>>  @@ -139,657 +139,661 @@
>>  
>>  
>>  
>> -  https://spark.apache.org/releases/spark-release-2-4-0.html
>> +  
>> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
>> 
>>
>>  Review comment:
>>Still remaining `localhost:4000` in this file.
>>
>> 
>> This is an automated message from the Apache Git Service.
>> To respond to the message, please log on GitHub and use the
>> URL above to go to the specific comment.
>>
>> For queries about this service, please contact Infrastructure at:
>> us...@infra.apache.org
>>
>>
>> With regards,
>> Apache Git Services
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Noisy spark-website notifications

2018-12-19 Thread Reynold Xin
I think there is an infra ticket open for it right now.

On Wed, Dec 19, 2018 at 6:58 PM Nicholas Chammas 
wrote:

> Can we somehow disable these new email alerts coming through for the Spark
> website repo?
>
> On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:
>
>> ueshin commented on a change in pull request #163: Announce the schedule
>> of 2019 Spark+AI summit at SF
>> URL:
>> https://github.com/apache/spark-website/pull/163#discussion_r243130975
>>
>>
>>
>>  ##
>>  File path: site/sitemap.xml
>>  ##
>>  @@ -139,657 +139,661 @@
>>  
>>  
>>  
>> -  https://spark.apache.org/releases/spark-release-2-4-0.html
>> +  
>> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
>> 
>>
>>  Review comment:
>>Still remaining `localhost:4000` in this file.
>>
>> 
>> This is an automated message from the Apache Git Service.
>> To respond to the message, please log on GitHub and use the
>> URL above to go to the specific comment.
>>
>> For queries about this service, please contact Infrastructure at:
>> us...@infra.apache.org
>>
>>
>> With regards,
>> Apache Git Services
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Noisy spark-website notifications

2018-12-19 Thread Nicholas Chammas
Can we somehow disable these new email alerts coming through for the Spark
website repo?

On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:

> ueshin commented on a change in pull request #163: Announce the schedule
> of 2019 Spark+AI summit at SF
> URL:
> https://github.com/apache/spark-website/pull/163#discussion_r243130975
>
>
>
>  ##
>  File path: site/sitemap.xml
>  ##
>  @@ -139,657 +139,661 @@
>  
>  
>  
> -  https://spark.apache.org/releases/spark-release-2-4-0.html
> +  
> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
> 
>
>  Review comment:
>Still remaining `localhost:4000` in this file.
>
> 
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on GitHub and use the
> URL above to go to the specific comment.
>
> For queries about this service, please contact Infrastructure at:
> us...@infra.apache.org
>
>
> With regards,
> Apache Git Services
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[GitHub] ueshin commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
ueshin commented on a change in pull request #163: Announce the schedule of 
2019 Spark+AI summit at SF
URL: https://github.com/apache/spark-website/pull/163#discussion_r243132369
 
 

 ##
 File path: site/sitemap.xml
 ##
 @@ -139,657 +139,661 @@
 
 
 
-  https://spark.apache.org/releases/spark-release-2-4-0.html
+  
http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
 
 Review comment:
   @gatorsmile oops, I mean, there are a lot of `localhost:4000` in this file.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[GitHub] gatorsmile commented on issue #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
gatorsmile commented on issue #163: Announce the schedule of 2019 Spark+AI 
summit at SF
URL: https://github.com/apache/spark-website/pull/163#issuecomment-448825575
 
 
   Thanks! Merged to master.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[GitHub] ueshin commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
ueshin commented on a change in pull request #163: Announce the schedule of 
2019 Spark+AI summit at SF
URL: https://github.com/apache/spark-website/pull/163#discussion_r243130975
 
 

 ##
 File path: site/sitemap.xml
 ##
 @@ -139,657 +139,661 @@
 
 
 
-  https://spark.apache.org/releases/spark-release-2-4-0.html
+  
http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
 
 Review comment:
   Still remaining `localhost:4000` in this file.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[GitHub] gatorsmile commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
gatorsmile commented on a change in pull request #163: Announce the schedule of 
2019 Spark+AI summit at SF
URL: https://github.com/apache/spark-website/pull/163#discussion_r243128948
 
 

 ##
 File path: site/mailing-lists.html
 ##
 @@ -12,7 +12,7 @@
 
   
 
-https://spark.apache.org/community.html; />
+http://localhost:4000/community.html; />
 
 Review comment:
   Need to fix this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[GitHub] gatorsmile commented on issue #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
gatorsmile commented on issue #163: Announce the schedule of 2019 Spark+AI 
summit at SF
URL: https://github.com/apache/spark-website/pull/163#issuecomment-448815820
 
 
   cc @rxin @yhuai @cloud-fan @srowen 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[GitHub] gatorsmile opened a new pull request #163: Announce the schedule of Spark+AI summit at SF 2019

2018-12-19 Thread GitBox
gatorsmile opened a new pull request #163: Announce the schedule of Spark+AI 
summit at SF 2019
URL: https://github.com/apache/spark-website/pull/163
 
 
   ![screen shot 2018-12-19 at 4 59 12 
pm](https://user-images.githubusercontent.com/11567269/50257364-d76e4900-03af-11e9-9690-3de0a87917ef.png)
   ![screen shot 2018-12-19 at 4 59 02 
pm](https://user-images.githubusercontent.com/11567269/50257365-d806df80-03af-11e9-9dff-fabc08bb64b5.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



SPARK-26415: Mark StreamSinkProvider and StreamSourceProvider as stable

2018-12-19 Thread Grant Henke
Hello Spark Developers,

Dongjoon Hyun suggested that I send an email to the dev list pointing to my
suggested change.

Jira: https://issues.apache.org/jira/browse/SPARK-26415
Pull request: https://github.com/apache/spark/pull/23354

For convenience I will post the commit message here:

This change marks the StreamSinkProvider and StreamSourceProvider
> interfaces as stable so that it can be relied on for compatibility for all
> of
> Spark 2.x.
>
> These interfaces have been available since Spark 2.0.0 and unchanged
> since Spark 2.1.0. Additionally the Kafka integration has been using it
> since Spark 2.1.0.
>
> Because structured streaming general availability was announced in
> Spark 2.2.0, I suspect there are other third-party integrations using it
> already as well.
>

For additional context, I would like to use the  StreamSinkProvider
interface in
Apache Kudu but do not feel comfortable committing the change without first
knowing the interface is stable.

I think leaving these interfaces as experimental was likely an oversight
given all structured streaming was marked GA, but still wanted the
annotation
change to be sure.

I appreciate your feedback and review.

Thank you,
Grant

-- 
Grant Henke
Software Engineer | Cloudera
gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke


SPARk-25299: Updates As Of December 19, 2018

2018-12-19 Thread Matt Cheah
Hi everyone,

 

Earlier this year, we proposed SPARK-25299, proposing the idea of using other 
storage systems for persisting shuffle files. Since that time, we have been 
continuing to work on prototypes for this project. In the interest of 
increasing transparency into our work, we have created a progress report 
document where you may find a summary of the work we have been doing, as well 
as links to our prototypes on Github. We would ask that anyone who is very 
familiar with the inner workings of Spark’s shuffle could provide feedback and 
comments on our work thus far. We welcome any further discussion in this space. 
You may comment in this e-mail thread or by commenting on the progress report 
document.

 

Looking forward to hearing from you. Thanks,

 

-Matt Cheah



smime.p7s
Description: S/MIME cryptographic signature


Updated proposal: Consistent timestamp types in Hadoop SQL engines

2018-12-19 Thread Zoltan Ivanfi
Dear All,

I would like to thank every reviewer of the consistent timestamps
proposal[1] for their time and valuable comments. Based on your
feedback, I have updated the proposal. The changes include
clarifications, fixes and other improvements as summarized at the end
of the document, in the Changelog section[2].

Another category of changes is declaring some topics as out-of-scope
with the intention to keep the scope under control. While these topics
are worth discussing, I suggest doing that in follow-up efforts. I
think it is easier to reach decisions in bite-sized chunks and the
proposal in its current form is already near the limit that is a
convenient read in a single sitting.

Please take a look at the updated proposal. I'm looking forward to
further feedback and suggestions.

Thanks,

Zoltan

[1] 
https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit
[2] 
https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.b90toonzuv1y

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] jenkins master needs reboot, temporary downtime

2018-12-19 Thread Reynold Xin
Thanks for taking care of this, Shane!

On Wed, Dec 19, 2018 at 9:45 AM, shane knapp < skn...@berkeley.edu > wrote:

> 
> master is back up and building.
> 
> On Wed, Dec 19, 2018 at 9:31 AM shane knapp < sknapp@ berkeley. edu (
> skn...@berkeley.edu ) > wrote:
> 
> 
>> the jenkins process seems to be wedged again, and i think we're going to
>> hit it w/the reboot hammer, rather than just killing/restarting the
>> master.
>> 
>> 
>> this should take at most 30 mins, and i'll send an all-clear when it's
>> done.
>> 
>> 
>> --
>> Shane Knapp
>> 
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> 
>> https:/ / rise. cs. berkeley. edu ( https://rise.cs.berkeley.edu )
>> 
>> 
> 
> 
> 
> 
> --
> Shane Knapp
> 
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> 
> https:/ / rise. cs. berkeley. edu ( https://rise.cs.berkeley.edu )
>

Re: [build system] jenkins master needs reboot, temporary downtime

2018-12-19 Thread shane knapp
master is back up and building.

On Wed, Dec 19, 2018 at 9:31 AM shane knapp  wrote:

> the jenkins process seems to be wedged again, and i think we're going to
> hit it w/the reboot hammer, rather than just killing/restarting the master.
>
> this should take at most 30 mins, and i'll send an all-clear when it's
> done.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins master needs reboot, temporary downtime

2018-12-19 Thread shane knapp
the jenkins process seems to be wedged again, and i think we're going to
hit it w/the reboot hammer, rather than just killing/restarting the master.

this should take at most 30 mins, and i'll send an all-clear when it's done.

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Parse xmlrdd with pyspark

2018-12-19 Thread Anshul Sachdeva
Hello Team,

I am trying to parse an xml with spark xml library, I am reading xml from
web service using python requests module in a variable then I need to parse
it before storing into target table.

I like to do this without saving a file somewhere and then load it.

I know in Java , I have used xmlreader class to parse an xml rdd.

But I am new to python , not sure how I can do the same in pyspark.

Any lead will be appreciable.

Thanks
Ansh


Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng
(don't know why your email ends with ".invalid")

On Wed, Dec 19, 2018 at 9:13 AM Xiangrui Meng  wrote:

>
>
> On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach 
> wrote:
> >
> > [Note: I sent this earlier but it looks like the email was blocked
> because I had another email group on the CC line]
> >
> > Hi Spark Dev,
> >
> > I would like to use the new barrier execution mode introduced in spark
> 2.4 with LightGBM in the spark package mmlspark but I ran into some issues
> and I had a couple questions.
> >
> > Currently, the LightGBM distributed learner tries to figure out the
> number of cores on the cluster and then does a coalesce and a
> mapPartitions, and inside the mapPartitions we do a NetworkInit (where the
> address:port of all workers needs to be passed in the constructor) and pass
> the data in-memory to the native layer of the distributed lightgbm learner.
> >
> >
> >
> > With barrier execution mode, I think the code would become much more
> robust.  However, there are several issues that I am running into when
> trying to move my code over to the new barrier execution mode scheduler:
> >
> > Does not support dynamic allocation – however, I think it would be
> convenient if it restarted the job when the number of workers has decreased
> and allowed the dev to decide whether to restart the job if the number of
> workers increased
>
> How does mmlspark handle dynamic allocation? Do you have a watch thread on
> the driver to restart the job if there are more workers? And when the
> number of workers decrease, can training continue without driver involved?
>
> > Does not work with DataFrame or Dataset API, but I think it would be
> much more convenient if it did
>
> DataFrame/Dataset do not have APIs to let users scan through the entire
> partition. The closest is Pandas UDF, which scans data per batch. I'm
> thinking about the following:
>
> If we change Pandas UDF to take an iterator of record batches (instead of
> a single batch), and per contract we say this iterator will iterate through
> the entire partition. So you only need to do NetworkInit once.
>
> > How does barrier execution mode deal with #partitions > #tasks?  If the
> number of partitions is larger than the number of “tasks” or workers, can
> barrier execution mode automatically coalesce the dataset to have #
> partitions == # tasks?
>
> It will hang there and print warning messages. We didn't assume user code
> can correctly handle dynamic worker sizes.
>
> > It would be convenient to be able to get network information about all
> other workers in the cluster that are in the same barrier execution, eg the
> host address and some task # or identifier of all workers
>
> See getTaskInfos() at
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.BarrierTaskContext
> .
>
> We also provide a barrier() method there to assist simple coordination
> among workers.
>
> >
> >
> >
> > I would love to hear more about this new feature – also I had trouble
> finding documentation (JIRA:
> https://issues.apache.org/jira/browse/SPARK-24374, High level design:
> https://www.slideshare.net/hadoop/the-zoo-expands?qid=b2efbd75-97af-4f71-9add-abf84970eaef==_search=1),
> are there any good examples of spark packages that have moved to use the
> new barrier execution mode in spark 2.4?
>
> Databricks (which I'm an employee of) implemented HorovodRunner
> ,
> which fully utilizes barrier execution mode. There is also a
> work-in-process open-source integration of Horovod/PySpark from Horovod
> author. Doing distributed deep learning training was the main use case
> considered in the design.
>
> Shall we have an offline meeting or open a JIRA to discuss more details
> about integrating mmlspark w/ barrier execution mode?
>
> >
> >
> >
> > Thank you, Ilya
>


Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng
On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach 
wrote:
>
> [Note: I sent this earlier but it looks like the email was blocked
because I had another email group on the CC line]
>
> Hi Spark Dev,
>
> I would like to use the new barrier execution mode introduced in spark
2.4 with LightGBM in the spark package mmlspark but I ran into some issues
and I had a couple questions.
>
> Currently, the LightGBM distributed learner tries to figure out the
number of cores on the cluster and then does a coalesce and a
mapPartitions, and inside the mapPartitions we do a NetworkInit (where the
address:port of all workers needs to be passed in the constructor) and pass
the data in-memory to the native layer of the distributed lightgbm learner.
>
>
>
> With barrier execution mode, I think the code would become much more
robust.  However, there are several issues that I am running into when
trying to move my code over to the new barrier execution mode scheduler:
>
> Does not support dynamic allocation – however, I think it would be
convenient if it restarted the job when the number of workers has decreased
and allowed the dev to decide whether to restart the job if the number of
workers increased

How does mmlspark handle dynamic allocation? Do you have a watch thread on
the driver to restart the job if there are more workers? And when the
number of workers decrease, can training continue without driver involved?

> Does not work with DataFrame or Dataset API, but I think it would be much
more convenient if it did

DataFrame/Dataset do not have APIs to let users scan through the entire
partition. The closest is Pandas UDF, which scans data per batch. I'm
thinking about the following:

If we change Pandas UDF to take an iterator of record batches (instead of a
single batch), and per contract we say this iterator will iterate through
the entire partition. So you only need to do NetworkInit once.

> How does barrier execution mode deal with #partitions > #tasks?  If the
number of partitions is larger than the number of “tasks” or workers, can
barrier execution mode automatically coalesce the dataset to have #
partitions == # tasks?

It will hang there and print warning messages. We didn't assume user code
can correctly handle dynamic worker sizes.

> It would be convenient to be able to get network information about all
other workers in the cluster that are in the same barrier execution, eg the
host address and some task # or identifier of all workers

See getTaskInfos() at
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.BarrierTaskContext
.

We also provide a barrier() method there to assist simple coordination
among workers.

>
>
>
> I would love to hear more about this new feature – also I had trouble
finding documentation (JIRA:
https://issues.apache.org/jira/browse/SPARK-24374, High level design:
https://www.slideshare.net/hadoop/the-zoo-expands?qid=b2efbd75-97af-4f71-9add-abf84970eaef==_search=1),
are there any good examples of spark packages that have moved to use the
new barrier execution mode in spark 2.4?

Databricks (which I'm an employee of) implemented HorovodRunner
,
which fully utilizes barrier execution mode. There is also a
work-in-process open-source integration of Horovod/PySpark from Horovod
author. Doing distributed deep learning training was the main use case
considered in the design.

Shall we have an offline meeting or open a JIRA to discuss more details
about integrating mmlspark w/ barrier execution mode?

>
>
>
> Thank you, Ilya


Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
Wenchen, can you give more detail about the different ADD COLUMN syntax?
That sounds confusing to end users to me.

On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan  wrote:

> Note that the design we make here will affect both data source developers
> and end-users. It's better to provide reliable behaviors to end-users,
> instead of asking them to read the spec of the data source and know which
> value will be used for missing columns, when they write data.
>
> If we do want to go with the "data source decides default value" approach,
> we should create a new SQL syntax for ADD COLUMN, as its behavior is
> different from the SQL standard ADD COLUMN command.
>
> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2
>> is for a unified way of dealing with missing columns? I feel like that
>> probably should be left up to the underlying datasource implementation. For
>> example if you have missing columns with a database the Datasource can
>> choose a value based on the Database's metadata if such a thing exists, I
>> don't think Spark should really have a this level of detail but I've also
>> missed out on all of these meetings (sorry it's family dinner time :) ) so
>> I may be missing something.
>>
>> So my tldr is, Let a datasource report whether or not missing columns are
>> OK and let the Datasource deal with the missing data based on it's
>> underlying storage.
>>
>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan  wrote:
>>
>>> I agree that we should not rewrite existing parquet files when a new
>>> column is added, but we should also try out best to make the behavior same
>>> as RDBMS/SQL standard.
>>>
>>> 1. it should be the user who decides the default value of a column, by
>>> CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>> 2. When adding a new column, the default value should be effective for
>>> all the existing data, and newly written data.
>>> 3. When altering an existing column and change the default value, it
>>> should be effective for newly written data only.
>>>
>>> A possible implementation:
>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>> 2. when adding a column with a default value, set both the initial one
>>> and the latest one to this value. But do not update existing data.
>>> 3. when reading data, fill the missing column with the initial default
>>> value
>>> 4. when writing data, fill the missing column with the latest default
>>> value
>>> 5. when altering a column to change its default value, only update the
>>> latest default value.
>>>
>>> This works because:
>>> 1. new files will be written with the latest default value, nothing we
>>> need to worry about at read time.
>>> 2. old files will be read with the initial default value, which returns
>>> expected result.
>>>
>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue 
>>> wrote:
>>>
 Hi everyone,

 This thread is a follow-up to a discussion that we started in the DSv2
 community sync last week.

 The problem I’m trying to solve is that the format I’m using DSv2 to
 integrate supports schema evolution. Specifically, adding a new optional
 column so that rows without that column get a default value (null for
 Iceberg). The current validation rule for an append in DSv2 fails a write
 if it is missing a column, so adding a column to an existing table will
 cause currently-scheduled jobs that insert data to start failing. Clearly,
 schema evolution shouldn't break existing jobs that produce valid data.

 To fix this problem, I suggested option 1: adding a way for Spark to
 check whether to fail when an optional column is missing. Other
 contributors in the sync thought that Spark should go with option 2:
 Spark’s schema should have defaults and Spark should handle filling in
 defaults the same way across all sources, like other databases.

 I think we agree that option 2 would be ideal. The problem is that it
 is very hard to implement.

 A source might manage data stored in millions of immutable Parquet
 files, so adding a default value isn’t possible. Spark would need to fill
 in defaults for files written before the column was added at read time (it
 could fill in defaults in new files at write time). Filling in defaults at
 read time would require Spark to fill in defaults for only some of the
 files in a scan, so Spark would need different handling for each task
 depending on the schema of that task. Tasks would also be required to
 produce a consistent schema, so a file without the new column couldn’t be
 combined into a task with a file that has the new column. This adds quite a
 bit of complexity.

 Other sources may not need Spark to fill in the default at all. A JDBC
 source would be capable of filling in the default values itself, 

Re: Spark-optimized Shuffle (SOS) any update?

2018-12-19 Thread Ilan Filonenko
Recently, the community has actively been working on this. The JIRA to
follow is:
https://issues.apache.org/jira/browse/SPARK-25299. A group of various
companies including Bloomberg and Palantir are in the works of a WIP
solution that implements a varied version of Option #5 (which is elaborated
upon in the google doc linked in the JIRA summary).

On Wed, Dec 19, 2018 at 5:20 AM  wrote:

> Hi everyone,
> we are facing same problems as Facebook had, where shuffle service is
> a bottleneck. For now we solved that with large task size (2g) to reduce
> shuffle I/O.
>
> I saw very nice presentation from Brian Cho on Optimizing shuffle I/O at
> large scale[1]. It is a implementation of white paper[2].
> Brian Cho at the end of the lecture kindly mentioned about plans to
> contribute it back to Spark[3]. I checked mailing list and spark JIRA and
> didn't find any ticket on this topic.
>
> Please, does anyone has a contact on someone from Facebook who could know
> more about this? Or are there some plans to bring similar optimization to
> Spark?
>
> [1] https://databricks.com/session/sos-optimizing-shuffle-i-o
> [2] https://haoyuzhang.org/publications/riffle-eurosys18.pdf
> [3]
> https://image.slidesharecdn.com/5brianchoerginseyfe-180613004126/95/sos-optimizing-shuffle-io-with-brian-cho-and-ergin-seyfe-30-638.jpg?cb=1528850545
>


barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Ilya Matiach
[Note: I sent this earlier but it looks like the email was blocked because I 
had another email group on the CC line]
Hi Spark Dev,
I would like to use the new barrier execution mode introduced in spark 
2.4
 with 
LightGBM
 in the spark package 
mmlspark
 but I ran into some issues and I had a couple questions.
Currently, the LightGBM distributed learner tries to figure out the number of 
cores on the cluster and then does a coalesce and a mapPartitions, and inside 
the mapPartitions we do a 
NetworkInit
 (where the address:port of all workers needs to be passed in the constructor) 
and pass the data in-memory to the native layer of the distributed lightgbm 
learner.

With barrier execution mode, I think the code would become much more robust.  
However, there are several issues that I am running into when trying to move my 
code over to the new barrier execution mode scheduler:

  1.  Does not support dynamic allocation - however, I think it would be 
convenient if it restarted the job when the number of workers has decreased and 
allowed the dev to decide whether to restart the job if the number of workers 
increased
  2.  Does not work with DataFrame or Dataset API, but I think it would be much 
more convenient if it did
  3.  How does barrier execution mode deal with #partitions > #tasks?  If the 
number of partitions is larger than the number of "tasks" or workers, can 
barrier execution mode automatically coalesce the dataset to have # partitions 
== # tasks?
  4.  It would be convenient to be able to get network information about all 
other workers in the cluster that are in the same barrier execution, eg the 
host address and some task # or identifier of all workers

I would love to hear more about this new feature - also I had trouble finding 
documentation (JIRA: 
https://issues.apache.org/jira/browse/SPARK-24374,
 High level design: 
https://www.slideshare.net/hadoop/the-zoo-expands?qid=b2efbd75-97af-4f71-9add-abf84970eaef==_search=1),
 are there any good examples of spark packages that have moved to use the new 
barrier execution mode in spark 2.4?

Thank you, Ilya


Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
Note that the design we make here will affect both data source developers
and end-users. It's better to provide reliable behaviors to end-users,
instead of asking them to read the spec of the data source and know which
value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach,
we should create a new SQL syntax for ADD COLUMN, as its behavior is
different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer 
wrote:

> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is
> for a unified way of dealing with missing columns? I feel like that
> probably should be left up to the underlying datasource implementation. For
> example if you have missing columns with a database the Datasource can
> choose a value based on the Database's metadata if such a thing exists, I
> don't think Spark should really have a this level of detail but I've also
> missed out on all of these meetings (sorry it's family dinner time :) ) so
> I may be missing something.
>
> So my tldr is, Let a datasource report whether or not missing columns are
> OK and let the Datasource deal with the missing data based on it's
> underlying storage.
>
> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan  wrote:
>
>> I agree that we should not rewrite existing parquet files when a new
>> column is added, but we should also try out best to make the behavior same
>> as RDBMS/SQL standard.
>>
>> 1. it should be the user who decides the default value of a column, by
>> CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>> 2. When adding a new column, the default value should be effective for
>> all the existing data, and newly written data.
>> 3. When altering an existing column and change the default value, it
>> should be effective for newly written data only.
>>
>> A possible implementation:
>> 1. a columnn has 2 default values: the initial one and the latest one.
>> 2. when adding a column with a default value, set both the initial one
>> and the latest one to this value. But do not update existing data.
>> 3. when reading data, fill the missing column with the initial default
>> value
>> 4. when writing data, fill the missing column with the latest default
>> value
>> 5. when altering a column to change its default value, only update the
>> latest default value.
>>
>> This works because:
>> 1. new files will be written with the latest default value, nothing we
>> need to worry about at read time.
>> 2. old files will be read with the initial default value, which returns
>> expected result.
>>
>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> This thread is a follow-up to a discussion that we started in the DSv2
>>> community sync last week.
>>>
>>> The problem I’m trying to solve is that the format I’m using DSv2 to
>>> integrate supports schema evolution. Specifically, adding a new optional
>>> column so that rows without that column get a default value (null for
>>> Iceberg). The current validation rule for an append in DSv2 fails a write
>>> if it is missing a column, so adding a column to an existing table will
>>> cause currently-scheduled jobs that insert data to start failing. Clearly,
>>> schema evolution shouldn't break existing jobs that produce valid data.
>>>
>>> To fix this problem, I suggested option 1: adding a way for Spark to
>>> check whether to fail when an optional column is missing. Other
>>> contributors in the sync thought that Spark should go with option 2:
>>> Spark’s schema should have defaults and Spark should handle filling in
>>> defaults the same way across all sources, like other databases.
>>>
>>> I think we agree that option 2 would be ideal. The problem is that it is
>>> very hard to implement.
>>>
>>> A source might manage data stored in millions of immutable Parquet
>>> files, so adding a default value isn’t possible. Spark would need to fill
>>> in defaults for files written before the column was added at read time (it
>>> could fill in defaults in new files at write time). Filling in defaults at
>>> read time would require Spark to fill in defaults for only some of the
>>> files in a scan, so Spark would need different handling for each task
>>> depending on the schema of that task. Tasks would also be required to
>>> produce a consistent schema, so a file without the new column couldn’t be
>>> combined into a task with a file that has the new column. This adds quite a
>>> bit of complexity.
>>>
>>> Other sources may not need Spark to fill in the default at all. A JDBC
>>> source would be capable of filling in the default values itself, so Spark
>>> would need some way to communicate the default to that source. If the
>>> source had a different policy for default values (write time instead of
>>> read time, for example) then behavior could still be inconsistent.
>>>
>>> I think that this complexity probably isn’t worth 

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Russell Spitzer
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is
for a unified way of dealing with missing columns? I feel like that
probably should be left up to the underlying datasource implementation. For
example if you have missing columns with a database the Datasource can
choose a value based on the Database's metadata if such a thing exists, I
don't think Spark should really have a this level of detail but I've also
missed out on all of these meetings (sorry it's family dinner time :) ) so
I may be missing something.

So my tldr is, Let a datasource report whether or not missing columns are
OK and let the Datasource deal with the missing data based on it's
underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan  wrote:

> I agree that we should not rewrite existing parquet files when a new
> column is added, but we should also try out best to make the behavior same
> as RDBMS/SQL standard.
>
> 1. it should be the user who decides the default value of a column, by
> CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
> 2. When adding a new column, the default value should be effective for all
> the existing data, and newly written data.
> 3. When altering an existing column and change the default value, it
> should be effective for newly written data only.
>
> A possible implementation:
> 1. a columnn has 2 default values: the initial one and the latest one.
> 2. when adding a column with a default value, set both the initial one and
> the latest one to this value. But do not update existing data.
> 3. when reading data, fill the missing column with the initial default
> value
> 4. when writing data, fill the missing column with the latest default value
> 5. when altering a column to change its default value, only update the
> latest default value.
>
> This works because:
> 1. new files will be written with the latest default value, nothing we
> need to worry about at read time.
> 2. old files will be read with the initial default value, which returns
> expected result.
>
> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue 
> wrote:
>
>> Hi everyone,
>>
>> This thread is a follow-up to a discussion that we started in the DSv2
>> community sync last week.
>>
>> The problem I’m trying to solve is that the format I’m using DSv2 to
>> integrate supports schema evolution. Specifically, adding a new optional
>> column so that rows without that column get a default value (null for
>> Iceberg). The current validation rule for an append in DSv2 fails a write
>> if it is missing a column, so adding a column to an existing table will
>> cause currently-scheduled jobs that insert data to start failing. Clearly,
>> schema evolution shouldn't break existing jobs that produce valid data.
>>
>> To fix this problem, I suggested option 1: adding a way for Spark to
>> check whether to fail when an optional column is missing. Other
>> contributors in the sync thought that Spark should go with option 2:
>> Spark’s schema should have defaults and Spark should handle filling in
>> defaults the same way across all sources, like other databases.
>>
>> I think we agree that option 2 would be ideal. The problem is that it is
>> very hard to implement.
>>
>> A source might manage data stored in millions of immutable Parquet files,
>> so adding a default value isn’t possible. Spark would need to fill in
>> defaults for files written before the column was added at read time (it
>> could fill in defaults in new files at write time). Filling in defaults at
>> read time would require Spark to fill in defaults for only some of the
>> files in a scan, so Spark would need different handling for each task
>> depending on the schema of that task. Tasks would also be required to
>> produce a consistent schema, so a file without the new column couldn’t be
>> combined into a task with a file that has the new column. This adds quite a
>> bit of complexity.
>>
>> Other sources may not need Spark to fill in the default at all. A JDBC
>> source would be capable of filling in the default values itself, so Spark
>> would need some way to communicate the default to that source. If the
>> source had a different policy for default values (write time instead of
>> read time, for example) then behavior could still be inconsistent.
>>
>> I think that this complexity probably isn’t worth consistency in default
>> values across sources, if that is even achievable.
>>
>> In the sync we thought it was a good idea to send this out to the larger
>> group to discuss. Please reply with comments!
>>
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
I agree that we should not rewrite existing parquet files when a new column
is added, but we should also try out best to make the behavior same as
RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by
CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all
the existing data, and newly written data.
3. When altering an existing column and change the default value, it should
be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and
the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the
latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need
to worry about at read time.
2. old files will be read with the initial default value, which returns
expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue  wrote:

> Hi everyone,
>
> This thread is a follow-up to a discussion that we started in the DSv2
> community sync last week.
>
> The problem I’m trying to solve is that the format I’m using DSv2 to
> integrate supports schema evolution. Specifically, adding a new optional
> column so that rows without that column get a default value (null for
> Iceberg). The current validation rule for an append in DSv2 fails a write
> if it is missing a column, so adding a column to an existing table will
> cause currently-scheduled jobs that insert data to start failing. Clearly,
> schema evolution shouldn't break existing jobs that produce valid data.
>
> To fix this problem, I suggested option 1: adding a way for Spark to check
> whether to fail when an optional column is missing. Other contributors in
> the sync thought that Spark should go with option 2: Spark’s schema should
> have defaults and Spark should handle filling in defaults the same way
> across all sources, like other databases.
>
> I think we agree that option 2 would be ideal. The problem is that it is
> very hard to implement.
>
> A source might manage data stored in millions of immutable Parquet files,
> so adding a default value isn’t possible. Spark would need to fill in
> defaults for files written before the column was added at read time (it
> could fill in defaults in new files at write time). Filling in defaults at
> read time would require Spark to fill in defaults for only some of the
> files in a scan, so Spark would need different handling for each task
> depending on the schema of that task. Tasks would also be required to
> produce a consistent schema, so a file without the new column couldn’t be
> combined into a task with a file that has the new column. This adds quite a
> bit of complexity.
>
> Other sources may not need Spark to fill in the default at all. A JDBC
> source would be capable of filling in the default values itself, so Spark
> would need some way to communicate the default to that source. If the
> source had a different policy for default values (write time instead of
> read time, for example) then behavior could still be inconsistent.
>
> I think that this complexity probably isn’t worth consistency in default
> values across sources, if that is even achievable.
>
> In the sync we thought it was a good idea to send this out to the larger
> group to discuss. Please reply with comments!
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Spark-optimized Shuffle (SOS) any update?

2018-12-19 Thread marek-simunek

Hi everyone,


    we are facing same problems as Facebook had, where shuffle service is a
bottleneck. For now we solved that with large task size (2g) to reduce
shuffle I/O.

I saw very nice presentation from Brian Cho on Optimizing shuffle I/O at 
large scale[1]. It is a implementation of white paper[2].
Brian Cho at the end of the lecture kindly mentioned about plans to
contribute it back to Spark[3]. I checked mailing list and spark JIRA and 
didn't find any ticket on this topic.

Please, does anyone has a contact on someone from Facebook who could know 
more about this? Or are there some plans to bring similar optimization to 
Spark?

[1] https://databricks.com/session/sos-optimizing-shuffle-i-o
[2] https://haoyuzhang.org/publications/riffle-eurosys18.pdf
[3] https://image.slidesharecdn.com/5brianchoerginseyfe-180613004126/95/sos-
optimizing-shuffle-io-with-brian-cho-and-ergin-seyfe-30-638.jpg?cb=
1528850545

Re: Decimals with negative scale

2018-12-19 Thread Marco Gaido
That is feasible, the main point is that negative scales were not really
meant to be there in the first place, so it something which was forgot to
be forbidden, and it is something which the DBs we are drawing our
inspiration from for decimals (mainly SQLServer) do not support.
Honestly, my opinion on this topic is:
 - let's add the support to negative scales in the operations (I have
already a PR out for that, https://github.com/apache/spark/pull/22450);
 - let's reduce our usage of DECIMAL in favor of DOUBLE when parsing
literals, as done by Hive, Presto, DB2, ...; so the number of cases when we
deal with negative scales in anyway small (and we do not have issues with
datasources which don't support them).

Thanks,
Marco


Il giorno mar 18 dic 2018 alle ore 19:08 Reynold Xin 
ha scritto:

> So why can't we just do validation to fail sources that don't support
> negative scale, if it is not supported? This way, we don't need to break
> backward compatibility in anyway and it becomes a strict improvement.
>
>
> On Tue, Dec 18, 2018 at 8:43 AM, Marco Gaido 
> wrote:
>
>> This is at analysis time.
>>
>> On Tue, 18 Dec 2018, 17:32 Reynold Xin >
>>> Is this an analysis time thing or a runtime thing?
>>>
>>> On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido 
>>> wrote:
>>>
 Hi all,

 as you may remember, there was a design doc to support operations
 involving decimals with negative scales. After the discussion in the design
 doc, now the related PR is blocked because for 3.0 we have another option
 which we can explore, ie. forbidding negative scales. This is probably a
 cleaner solution, as most likely we didn't want negative scales, but it is
 a breaking change: so we wanted to check the opinion of the community.

 Getting to the topic, here there are the 2 options:
 * - Forbidding negative scales*
   Pros: many sources do not support negative scales (so they can create
 issues); they were something which was not considered as possible in the
 initial implementation, so we get to a more stable situation.
   Cons: some operations which were supported earlier, won't be working
 anymore. Eg. since our max precision is 38, if the scale cannot be negative
 1e36 * 1e36 would cause an overflow, while now works fine (producing a
 decimal with negative scale); basically impossible to create a config which
 controls the behavior.

  *- Handling negative scales in operations*
   Pros: no regressions; we support all the operations we supported on
 2.x.
   Cons: negative scales can cause issues in other moments, eg. when
 saving to a data source which doesn't support them.

 Looking forward to hear your thoughts,
 Thanks.
 Marco

>>>
>