[GitHub] gatorsmile commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
gatorsmile commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF URL: https://github.com/apache/spark-website/pull/163#discussion_r243158425 ## File path: site/sitemap.xml ## @@ -139,657 +139,661 @@ - https://spark.apache.or

Re: Noisy spark-website notifications

2018-12-19 Thread Reynold Xin
I added my comment there too! On Wed, Dec 19, 2018 at 7:26 PM, Hyukjin Kwon < gurwls...@gmail.com > wrote: > > Yea, that's a bit noisy .. I would just completely disable it to be > honest. I failed https:/ / issues. apache. org/ jira/ browse/ INFRA-17469 ( > https://issues.apache.org/jira/browse

Re: Noisy spark-website notifications

2018-12-19 Thread Hyukjin Kwon
Yea, that's a bit noisy .. I would just completely disable it to be honest. I failed https://issues.apache.org/jira/browse/INFRA-17469 before. I would appreciate if there would be more inputs there :-) 2018년 12월 20일 (목) 오전 11:22, Nicholas Chammas 님이 작성: > I'd prefer it if we disabled all git noti

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior? > pass the default through to the underlying data source This is one way to implement the behavior. On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue wrote: > I don't think we have to change the syntax. Isn

Re: Noisy spark-website notifications

2018-12-19 Thread Nicholas Chammas
I'd prefer it if we disabled all git notifications for spark-website. Folks who want to stay on top of what's happening with the site can simply watch the repo on GitHub , no? On Wed, Dec 19, 2018 at 10:00 PM Wenchen Fan wrote: > +1, at least it should on

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception. On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan wrote: > The standard ADD COLUMN SQL syntax is: AL

Re: Noisy spark-website notifications

2018-12-19 Thread Wenchen Fan
+1, at least it should only send one email when a PR is merged. On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Can we somehow disable these new email alerts coming through for the Spark > website repo? > > On Wed, Dec 19, 2018 at 8:25 PM GitBox wrote: >

Re: Noisy spark-website notifications

2018-12-19 Thread Reynold Xin
I think there is an infra ticket open for it right now. On Wed, Dec 19, 2018 at 6:58 PM Nicholas Chammas wrote: > Can we somehow disable these new email alerts coming through for the Spark > website repo? > > On Wed, Dec 19, 2018 at 8:25 PM GitBox wrote: > >> ueshin commented on a change in pul

Noisy spark-website notifications

2018-12-19 Thread Nicholas Chammas
Can we somehow disable these new email alerts coming through for the Spark website repo? On Wed, Dec 19, 2018 at 8:25 PM GitBox wrote: > ueshin commented on a change in pull request #163: Announce the schedule > of 2019 Spark+AI summit at SF > URL: > https://github.com/apache/spark-website/pull/

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value]; If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we sh

[GitHub] ueshin commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
ueshin commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF URL: https://github.com/apache/spark-website/pull/163#discussion_r243132369 ## File path: site/sitemap.xml ## @@ -139,657 +139,661 @@ - https://spark.apache.org/re

[GitHub] gatorsmile commented on issue #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
gatorsmile commented on issue #163: Announce the schedule of 2019 Spark+AI summit at SF URL: https://github.com/apache/spark-website/pull/163#issuecomment-448825575 Thanks! Merged to master. This is an automated message from

[GitHub] ueshin commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
ueshin commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF URL: https://github.com/apache/spark-website/pull/163#discussion_r243130975 ## File path: site/sitemap.xml ## @@ -139,657 +139,661 @@ - https://spark.apache.org/re

[GitHub] gatorsmile commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
gatorsmile commented on a change in pull request #163: Announce the schedule of 2019 Spark+AI summit at SF URL: https://github.com/apache/spark-website/pull/163#discussion_r243128948 ## File path: site/mailing-lists.html ## @@ -12,7 +12,7 @@ -https://spark.

[GitHub] gatorsmile commented on issue #163: Announce the schedule of 2019 Spark+AI summit at SF

2018-12-19 Thread GitBox
gatorsmile commented on issue #163: Announce the schedule of 2019 Spark+AI summit at SF URL: https://github.com/apache/spark-website/pull/163#issuecomment-448815820 cc @rxin @yhuai @cloud-fan @srowen This is an automated mes

[GitHub] gatorsmile opened a new pull request #163: Announce the schedule of Spark+AI summit at SF 2019

2018-12-19 Thread GitBox
gatorsmile opened a new pull request #163: Announce the schedule of Spark+AI summit at SF 2019 URL: https://github.com/apache/spark-website/pull/163 ![screen shot 2018-12-19 at 4 59 12 pm](https://user-images.githubusercontent.com/11567269/50257364-d76e4900-03af-11e9-9690-3de0a87917ef.png)

SPARK-26415: Mark StreamSinkProvider and StreamSourceProvider as stable

2018-12-19 Thread Grant Henke
Hello Spark Developers, Dongjoon Hyun suggested that I send an email to the dev list pointing to my suggested change. Jira: https://issues.apache.org/jira/browse/SPARK-26415 Pull request: https://github.com/apache/spark/pull/23354 For convenience I will post the commit message here: This change

Re: SPARk-25299: Updates As Of December 19, 2018

2018-12-19 Thread John Zhuge
Matt, appreciate the update! On Wed, Dec 19, 2018 at 10:51 AM Matt Cheah wrote: > Hi everyone, > > > > Earlier this year, we proposed SPARK-25299 > , proposing the idea > of using other storage systems for persisting shuffle files. Since that >

SPARk-25299: Updates As Of December 19, 2018

2018-12-19 Thread Matt Cheah
Hi everyone, Earlier this year, we proposed SPARK-25299, proposing the idea of using other storage systems for persisting shuffle files. Since that time, we have been continuing to work on prototypes for this project. In the interest of increasing transparency into our work, we have created

Updated proposal: Consistent timestamp types in Hadoop SQL engines

2018-12-19 Thread Zoltan Ivanfi
Dear All, I would like to thank every reviewer of the consistent timestamps proposal[1] for their time and valuable comments. Based on your feedback, I have updated the proposal. The changes include clarifications, fixes and other improvements as summarized at the end of the document, in the Chang

Re: [build system] jenkins master needs reboot, temporary downtime

2018-12-19 Thread Reynold Xin
Thanks for taking care of this, Shane! On Wed, Dec 19, 2018 at 9:45 AM, shane knapp < skn...@berkeley.edu > wrote: > > master is back up and building. > > On Wed, Dec 19, 2018 at 9:31 AM shane knapp < sknapp@ berkeley. edu ( > skn...@berkeley.edu ) > wrote: > > >> the jenkins process seems to

Re: [build system] jenkins master needs reboot, temporary downtime

2018-12-19 Thread shane knapp
master is back up and building. On Wed, Dec 19, 2018 at 9:31 AM shane knapp wrote: > the jenkins process seems to be wedged again, and i think we're going to > hit it w/the reboot hammer, rather than just killing/restarting the master. > > this should take at most 30 mins, and i'll send an all-c

[build system] jenkins master needs reboot, temporary downtime

2018-12-19 Thread shane knapp
the jenkins process seems to be wedged again, and i think we're going to hit it w/the reboot hammer, rather than just killing/restarting the master. this should take at most 30 mins, and i'll send an all-clear when it's done. -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lea

Parse xmlrdd with pyspark

2018-12-19 Thread Anshul Sachdeva
Hello Team, I am trying to parse an xml with spark xml library, I am reading xml from web service using python requests module in a variable then I need to parse it before storing into target table. I like to do this without saving a file somewhere and then load it. I know in Java , I have used

Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng
(don't know why your email ends with ".invalid") On Wed, Dec 19, 2018 at 9:13 AM Xiangrui Meng wrote: > > > On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach > wrote: > > > > [Note: I sent this earlier but it looks like the email was blocked > because I had another email group on the CC line] > > >

Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng
On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach wrote: > > [Note: I sent this earlier but it looks like the email was blocked because I had another email group on the CC line] > > Hi Spark Dev, > > I would like to use the new barrier execution mode introduced in spark 2.4 with LightGBM in the spark p

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me. On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan wrote: > Note that the design we make here will affect both data source developers > and end-users. It's better to provide reliable behav

Re: Spark-optimized Shuffle (SOS) any update?

2018-12-19 Thread Ilan Filonenko
Recently, the community has actively been working on this. The JIRA to follow is: https://issues.apache.org/jira/browse/SPARK-25299. A group of various companies including Bloomberg and Palantir are in the works of a WIP solution that implements a varied version of Option #5 (which is elaborated up

barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Ilya Matiach
[Note: I sent this earlier but it looks like the email was blocked because I had another email group on the CC line] Hi Spark Dev, I would like to use the new barrier execution mode introduced in spark 2.4

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data. If we do want to

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Russell Spitzer
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard. 1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTE

Spark-optimized Shuffle (SOS) any update?

2018-12-19 Thread marek-simunek
Hi everyone,     we are facing same problems as Facebook had, where shuffle service is a bottleneck. For now we solved that with large task size (2g) to reduce shuffle I/O. I saw very nice presentation from Brian Cho on Optimizing shuffle I/O at large scale[1]. It is a implementation of white

Re: Decimals with negative scale

2018-12-19 Thread Marco Gaido
That is feasible, the main point is that negative scales were not really meant to be there in the first place, so it something which was forgot to be forbidden, and it is something which the DBs we are drawing our inspiration from for decimals (mainly SQLServer) do not support. Honestly, my opinion