[GitHub] spark pull request #21422: [Spark-24376][doc]Summary:compiling spark with sc...

2018-05-29 Thread gentlewangyu
Github user gentlewangyu closed the pull request at:

https://github.com/apache/spark/pull/21422


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...

2018-05-29 Thread gentlewangyu
Github user gentlewangyu commented on the issue:

https://github.com/apache/spark/pull/21422
  
@HyukjinKwon @jerryshao  ok , thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...

2018-05-24 Thread gentlewangyu
Github user gentlewangyu commented on the issue:

https://github.com/apache/spark/pull/21422
  
@HyukjinKwon thanks for your answer!
I have tested it by using the command (build/mvn clean package install 
deploy -DskipTests -Pscala-2.10)


![image](https://user-images.githubusercontent.com/24363196/40529828-0d6aa72c-6029-11e8-869b-5440a38814b2.png)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21422: [Spark-24376][doc]Summary:compiling spark with sc...

2018-05-24 Thread gentlewangyu
Github user gentlewangyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21422#discussion_r190541918
  
--- Diff: docs/building-spark.md ---
@@ -92,10 +92,10 @@ like ZooKeeper and Hadoop itself.
 ./build/mvn -Pmesos -DskipTests clean package
 
 ## Building for Scala 2.10
-To produce a Spark package compiled with Scala 2.10, use the 
`-Dscala-2.10` property:
+To produce a Spark package compiled with Scala 2.10, use the 
`-Pscala-2.10` property:
 
 ./dev/change-scala-version.sh 2.10
-./build/mvn -Pyarn -Dscala-2.10 -DskipTests clean package
+./build/mvn -Pyarn -scala-2.10 -DskipTestsP clean package
--- End diff --

sorry , It's -Pscala-2.10


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...

2018-05-24 Thread gentlewangyu
Github user gentlewangyu commented on the issue:

https://github.com/apache/spark/pull/21422
  
I have verified this patch


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21422: [Spark-24376][doc]Summary:compiling spark with sc...

2018-05-24 Thread gentlewangyu
GitHub user gentlewangyu opened a pull request:

https://github.com/apache/spark/pull/21422

[Spark-24376][doc]Summary:compiling spark with scala-2.10 should use the -P 
parameter instead of -D



 What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)



## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gentlewangyu/spark SPARK-24376

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21422.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21422


commit d5167ed3dd3ce87a8904c2444eaf6c839ea541d3
Author: wangyu 
Date:   2018-05-24T08:06:47Z

git config --global user.email "wan...@cmss.chinamobile.com"

git config --global user.name "gentlewangyu"




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21421: Myfeature

2018-05-24 Thread gentlewangyu
Github user gentlewangyu closed the pull request at:

https://github.com/apache/spark/pull/21421


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21421: Myfeature

2018-05-24 Thread gentlewangyu
GitHub user gentlewangyu opened a pull request:

https://github.com/apache/spark/pull/21421

Myfeature

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gentlewangyu/spark myfeature

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21421


commit 9949fed1c45865b6e5e8ebe610789c5fb9546052
Author: Corey Woodfield 
Date:   2017-07-19T22:21:38Z

[SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of 
Dataset#joinWith

## What changes were proposed in this pull request?

Two invalid join types were mistakenly listed in the javadoc for joinWith, 
in the Dataset class. I presume these were copied from the javadoc of join, but 
since joinWith returns a Dataset\, left_semi and left_anti are 
invalid, as they only return values from one of the datasets, instead of from 
both

## How was this patch tested?

I ran the following code :
```
public static void main(String[] args) {
SparkSession spark = new SparkSession(new SparkContext("local[*]", 
"Test"));
Dataset one = spark.createDataFrame(Arrays.asList(new Bean(1), new 
Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
Dataset two = spark.createDataFrame(Arrays.asList(new Bean(4), new 
Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);

try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"inner").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"cross").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_semi").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_anti").show();} catch (Exception e) {e.printStackTrace();}
}
```
which tests all the different join types, and the last two (left_semi and 
left_anti) threw exceptions. The same code using join instead of joinWith did 
fine. The Bean class was just a java bean with a single int field, x.

Author: Corey Woodfield 

Closes #18462 from coreywoodfield/master.

(cherry picked from commit 8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982)
Signed-off-by: gatorsmile 

commit 88dccda393bc79dc6032f71b6acf8eb2b4b152be
Author: Dhruve Ashar 
Date:   2017-07-21T19:03:46Z

[SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch

For configurations with external shuffle enabled, we have observed that if 
a very large no. of blocks are being fetched from a remote host, it puts the NM 
under extra pressure and can crash it. This change introduces a configuration 
`spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs 
being fetched from a given remote address. The changes applied here are 
applicable for both the scenarios - when external shuffle is enabled as well as 
disabled.

Ran the job with the default configuration which does not c

[GitHub] spark pull request #21419: Branch 2.2

2018-05-24 Thread gentlewangyu
Github user gentlewangyu closed the pull request at:

https://github.com/apache/spark/pull/21419


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21419: Branch 2.2

2018-05-23 Thread gentlewangyu
GitHub user gentlewangyu opened a pull request:

https://github.com/apache/spark/pull/21419

Branch 2.2

## What changes were proposed in this pull request?

compiling spark with scala-2.10 should use the -P parameter instead of -D

## How was this patch tested?

 ./build/mvn -Pyarn -Pscala-2.10 -DskipTests clean package


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21419


commit 9949fed1c45865b6e5e8ebe610789c5fb9546052
Author: Corey Woodfield 
Date:   2017-07-19T22:21:38Z

[SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of 
Dataset#joinWith

## What changes were proposed in this pull request?

Two invalid join types were mistakenly listed in the javadoc for joinWith, 
in the Dataset class. I presume these were copied from the javadoc of join, but 
since joinWith returns a Dataset\, left_semi and left_anti are 
invalid, as they only return values from one of the datasets, instead of from 
both

## How was this patch tested?

I ran the following code :
```
public static void main(String[] args) {
SparkSession spark = new SparkSession(new SparkContext("local[*]", 
"Test"));
Dataset one = spark.createDataFrame(Arrays.asList(new Bean(1), new 
Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
Dataset two = spark.createDataFrame(Arrays.asList(new Bean(4), new 
Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);

try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"inner").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"cross").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_semi").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_anti").show();} catch (Exception e) {e.printStackTrace();}
}
```
which tests all the different join types, and the last two (left_semi and 
left_anti) threw exceptions. The same code using join instead of joinWith did 
fine. The Bean class was just a java bean with a single int field, x.

Author: Corey Woodfield 

Closes #18462 from coreywoodfield/master.

(cherry picked from commit 8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982)
Signed-off-by: gatorsmile 

commit 88dccda393bc79dc6032f71b6acf8eb2b4b152be
Author: Dhruve Ashar 
Date:   2017-07-21T19:03:46Z

[SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch

For configurations with external shuffle enabled, we have observed that if 
a very large no. of blocks are being fetched from a remote host, it puts the NM 
under extra pressure and can crash it. This change introduces a configuration 
`spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs 
being fetched from a given remote address. The changes applied here are 
applicable for both the scenarios - when external shuffle is enabled as well as 
disabled.

Ran the job with the default configuration which does not change the 
existing behavior and ran it with few configurations of lower values 
-10,20,50,100. The job ran fine and there is no change in the output. (I will 
update the metrics related to NM in some ti

[GitHub] spark pull request #21418: Branch 2.2

2018-05-23 Thread gentlewangyu
Github user gentlewangyu closed the pull request at:

https://github.com/apache/spark/pull/21418


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21417: Branch 2.0

2018-05-23 Thread gentlewangyu
Github user gentlewangyu closed the pull request at:

https://github.com/apache/spark/pull/21417


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21418: Branch 2.2

2018-05-23 Thread gentlewangyu
GitHub user gentlewangyu opened a pull request:

https://github.com/apache/spark/pull/21418

Branch 2.2

## What changes were proposed in this pull request?

compiling spark with scala-2.10 should use the -p parameter instead of -d

## How was this patch tested?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21418.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21418


commit 9949fed1c45865b6e5e8ebe610789c5fb9546052
Author: Corey Woodfield 
Date:   2017-07-19T22:21:38Z

[SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of 
Dataset#joinWith

## What changes were proposed in this pull request?

Two invalid join types were mistakenly listed in the javadoc for joinWith, 
in the Dataset class. I presume these were copied from the javadoc of join, but 
since joinWith returns a Dataset\, left_semi and left_anti are 
invalid, as they only return values from one of the datasets, instead of from 
both

## How was this patch tested?

I ran the following code :
```
public static void main(String[] args) {
SparkSession spark = new SparkSession(new SparkContext("local[*]", 
"Test"));
Dataset one = spark.createDataFrame(Arrays.asList(new Bean(1), new 
Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
Dataset two = spark.createDataFrame(Arrays.asList(new Bean(4), new 
Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);

try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"inner").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"cross").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_semi").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_anti").show();} catch (Exception e) {e.printStackTrace();}
}
```
which tests all the different join types, and the last two (left_semi and 
left_anti) threw exceptions. The same code using join instead of joinWith did 
fine. The Bean class was just a java bean with a single int field, x.

Author: Corey Woodfield 

Closes #18462 from coreywoodfield/master.

(cherry picked from commit 8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982)
Signed-off-by: gatorsmile 

commit 88dccda393bc79dc6032f71b6acf8eb2b4b152be
Author: Dhruve Ashar 
Date:   2017-07-21T19:03:46Z

[SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch

For configurations with external shuffle enabled, we have observed that if 
a very large no. of blocks are being fetched from a remote host, it puts the NM 
under extra pressure and can crash it. This change introduces a configuration 
`spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs 
being fetched from a given remote address. The changes applied here are 
applicable for both the scenarios - when external shuffle is enabled as well as 
disabled.

Ran the job with the default configuration which does not change the 
existing behavior and ran it with few configurations of lower values 
-10,20,50,100. The job ran fine and there is no change in the output. (I will 
update the metrics related to NM in some time.)

Author: Dhruve Ashar 

Closes #18487 from dhruv

[GitHub] spark pull request #21417: Branch 2.0

2018-05-23 Thread gentlewangyu
GitHub user gentlewangyu opened a pull request:

https://github.com/apache/spark/pull/21417

Branch 2.0

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21417.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21417


commit 050b8177e27df06d33a6f6f2b3b6a952b0d03ba6
Author: cody koeninger 
Date:   2016-10-12T22:22:06Z

[SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of 
poll twice

## What changes were proposed in this pull request?

Alternative approach to https://github.com/apache/spark/pull/15387

Author: cody koeninger 

Closes #15401 from koeninger/SPARK-17782-alt.

(cherry picked from commit f9a56a153e0579283160519065c7f3620d12da3e)
Signed-off-by: Shixiong Zhu 

commit 5903dabc57c07310573babe94e4f205bdea6455f
Author: Brian Cho 
Date:   2016-10-13T03:43:18Z

[SPARK-16827][BRANCH-2.0] Avoid reporting spill metrics as shuffle metrics

## What changes were proposed in this pull request?

Fix a bug where spill metrics were being reported as shuffle metrics. 
Eventually these spill metrics should be reported (SPARK-3577), but separate 
from shuffle metrics. The fix itself basically reverts the line to what it was 
in 1.6.

## How was this patch tested?

Cherry-picked from master (#15347)

Author: Brian Cho 

Closes #15455 from dafrista/shuffle-metrics-2.0.

commit ab00e410c6b1d7dafdfabcea1f249c78459b94f0
Author: Burak Yavuz 
Date:   2016-10-13T04:40:45Z

[SPARK-17876] Write StructuredStreaming WAL to a stream instead of 
materializing all at once

## What changes were proposed in this pull request?

The CompactibleFileStreamLog materializes the whole metadata log in memory 
as a String. This can cause issues when there are lots of files that are being 
committed, especially during a compaction batch.
You may come across stacktraces that look like:
```
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at 
org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)

```
The safer way is to write to an output stream so that we don't have to 
materialize a huge string.

## How was this patch tested?

Existing unit tests

Author: Burak Yavuz 

Closes #15437 from brkyvz/ser-to-stream.

(cherry picked from commit edeb51a39d76d64196d7635f52be1b42c7ec4341)
Signed-off-by: Shixiong Zhu 

commit d38f38a093b4dff32c686675d93ab03e7a8f4908
Author: buzhihuojie 
Date:   2016-10-13T05:51:54Z

minor doc fix for Row.scala

## What changes were proposed in this pull request?

minor doc fix for "getAnyValAs" in class Row

## How was this patch tested?

None.

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Author: buzhihuojie 

Closes #15452 from david-weiluo-ren/minorDocFixForRow.

(cherry picked from commit 7222a25a11790fa9d9d1428c84b6f827a785c9e8)
Signed-off-by: Reynold Xin 

commit d7fa3e32421c73adfa522adfeeb970edd4c22eb3
Author: Shixiong Zhu 
Date:   2016-10-13T20:31:50Z

[SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource 
instead of counting on KafkaConsumer

## What changes were proposed in this pull request?

Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR 
just calls `seekToBeginning` to manually set the earliest offsets for the 
KafkaSource initial offsets.

## How was this patch tested?

Existing tests.

Author: Shixiong Zhu 

Closes #15397 from zsxwing/SPARK-17834.

(cherry picked from commit 08eac356095c7faa2b19d52f2fb0cbc47eb7d1d1)
Signed-off-by: Shixiong Zhu 

commit c53b8374911e801ed98c1436c384f0aef076eaab
Author: Davies Liu 
Date:   2016-10-14T21:45:20Z

[SPARK-17863][SQL] should not add column into Distinct

## What changes were proposed in this pull request?

We are trying to resolve the attribute in sort by pulling up some column 
for grandchild into ch