[GitHub] dongjoon-hyun commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
dongjoon-hyun commented on a change in pull request #169: Add Apache Spark 
2.2.3 release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247789724
 
 

 ##
 File path: js/downloads.js
 ##
 @@ -36,7 +36,8 @@ addRelease("2.4.0", new Date("11/02/2018"), packagesV9, 
true, true);
 addRelease("2.3.2", new Date("09/24/2018"), packagesV8, true, true);
 addRelease("2.3.1", new Date("06/08/2018"), packagesV8, true, true);
 
 Review comment:
   Shall I make another PR to remote this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] dongjoon-hyun commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
dongjoon-hyun commented on a change in pull request #169: Add Apache Spark 
2.2.3 release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247789724
 
 

 ##
 File path: js/downloads.js
 ##
 @@ -36,7 +36,8 @@ addRelease("2.4.0", new Date("11/02/2018"), packagesV9, 
true, true);
 addRelease("2.3.2", new Date("09/24/2018"), packagesV8, true, true);
 addRelease("2.3.1", new Date("06/08/2018"), packagesV8, true, true);
 
 Review comment:
   Shall I make another PR to remove this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 
release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247789120
 
 

 ##
 File path: js/downloads.js
 ##
 @@ -36,7 +36,8 @@ addRelease("2.4.0", new Date("11/02/2018"), packagesV9, 
true, true);
 addRelease("2.3.2", new Date("09/24/2018"), packagesV8, true, true);
 addRelease("2.3.1", new Date("06/08/2018"), packagesV8, true, true);
 
 Review comment:
   just checked - I could remove it from dist/release later - after this is 
changed to false.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 
release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247788371
 
 

 ##
 File path: js/downloads.js
 ##
 @@ -36,7 +36,8 @@ addRelease("2.4.0", new Date("11/02/2018"), packagesV9, 
true, true);
 addRelease("2.3.2", new Date("09/24/2018"), packagesV8, true, true);
 addRelease("2.3.1", new Date("06/08/2018"), packagesV8, true, true);
 
 Review comment:
   though I tested a few mirror link for 2.3.1, seems like it is still 
replicated on mirrors..


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] dongjoon-hyun commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
dongjoon-hyun commented on a change in pull request #169: Add Apache Spark 
2.2.3 release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247787955
 
 

 ##
 File path: js/downloads.js
 ##
 @@ -36,7 +36,8 @@ addRelease("2.4.0", new Date("11/02/2018"), packagesV9, 
true, true);
 addRelease("2.3.2", new Date("09/24/2018"), packagesV8, true, true);
 addRelease("2.3.1", new Date("06/08/2018"), packagesV8, true, true);
 
 Review comment:
   Yes, but `2.3.1` is still on the mirror.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 
release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247787593
 
 

 ##
 File path: js/downloads.js
 ##
 @@ -36,7 +36,8 @@ addRelease("2.4.0", new Date("11/02/2018"), packagesV9, 
true, true);
 addRelease("2.3.2", new Date("09/24/2018"), packagesV8, true, true);
 addRelease("2.3.1", new Date("06/08/2018"), packagesV8, true, true);
 
 Review comment:
   btw, this should be `packagesV8, true, false);` I think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] dongjoon-hyun closed pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
dongjoon-hyun closed pull request #169: Add Apache Spark 2.2.3 release news and 
update links
URL: https://github.com/apache/spark-website/pull/169
 
 
   

As this is a foreign pull request (from a fork), the diff has been
sent to your commit mailing list, commits@spark.apache.org


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] dongjoon-hyun commented on issue #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
dongjoon-hyun commented on issue #169: Add Apache Spark 2.2.3 release news and 
update links
URL: https://github.com/apache/spark-website/pull/169#issuecomment-454294453
 
 
   Thank you, @felixcheung and @HyukjinKwon .


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] HyukjinKwon commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
HyukjinKwon commented on a change in pull request #169: Add Apache Spark 2.2.3 
release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247783477
 
 

 ##
 File path: site/mailing-lists.html
 ##
 @@ -12,7 +12,7 @@
 
   
 
-https://spark.apache.org/community.html"; />
+http://localhost:4000/community.html"; />
 
 Review comment:
   Yea, IIRC, this happens when jeykill is ran with serve.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] dongjoon-hyun commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
dongjoon-hyun commented on a change in pull request #169: Add Apache Spark 
2.2.3 release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247783418
 
 

 ##
 File path: js/downloads.js
 ##
 @@ -36,7 +36,8 @@ addRelease("2.4.0", new Date("11/02/2018"), packagesV9, 
true, true);
 addRelease("2.3.2", new Date("09/24/2018"), packagesV8, true, true);
 addRelease("2.3.1", new Date("06/08/2018"), packagesV8, true, true);
 addRelease("2.3.0", new Date("02/28/2018"), packagesV8, true, false);
-addRelease("2.2.2", new Date("07/02/2018"), packagesV8, true, true);
+addRelease("2.2.3", new Date("01/11/2018"), packagesV8, true, true);
 
 Review comment:
   oh.. Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 
release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247782800
 
 

 ##
 File path: js/downloads.js
 ##
 @@ -36,7 +36,8 @@ addRelease("2.4.0", new Date("11/02/2018"), packagesV9, 
true, true);
 addRelease("2.3.2", new Date("09/24/2018"), packagesV8, true, true);
 addRelease("2.3.1", new Date("06/08/2018"), packagesV8, true, true);
 addRelease("2.3.0", new Date("02/28/2018"), packagesV8, true, false);
-addRelease("2.2.2", new Date("07/02/2018"), packagesV8, true, true);
+addRelease("2.2.3", new Date("01/11/2018"), packagesV8, true, true);
 
 Review comment:
   2018 -> 2019 ;)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 
release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247783039
 
 

 ##
 File path: site/js/downloads.js
 ##
 @@ -36,7 +36,8 @@ addRelease("2.4.0", new Date("11/02/2018"), packagesV9, 
true, true);
 addRelease("2.3.2", new Date("09/24/2018"), packagesV8, true, true);
 addRelease("2.3.1", new Date("06/08/2018"), packagesV8, true, true);
 addRelease("2.3.0", new Date("02/28/2018"), packagesV8, true, false);
-addRelease("2.2.2", new Date("07/02/2018"), packagesV8, true, true);
+addRelease("2.2.3", new Date("01/11/2018"), packagesV8, true, true);
 
 Review comment:
   2018 -> 2019


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
felixcheung commented on a change in pull request #169: Add Apache Spark 2.2.3 
release news and update links
URL: https://github.com/apache/spark-website/pull/169#discussion_r247783110
 
 

 ##
 File path: site/mailing-lists.html
 ##
 @@ -12,7 +12,7 @@
 
   
 
-https://spark.apache.org/community.html"; />
+http://localhost:4000/community.html"; />
 
 Review comment:
   this shouldn't change?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] dongjoon-hyun commented on issue #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
dongjoon-hyun commented on issue #169: Add Apache Spark 2.2.3 release news and 
update links
URL: https://github.com/apache/spark-website/pull/169#issuecomment-454289763
 
 
   Hi, @srowen .
   Could you review this PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] dongjoon-hyun opened a new pull request #169: Add Apache Spark 2.2.3 release news and update links

2019-01-14 Thread GitBox
dongjoon-hyun opened a new pull request #169: Add Apache Spark 2.2.3 release 
news and update links
URL: https://github.com/apache/spark-website/pull/169
 
 
   This PR adds or updates the followings.
   
   **1. News**
   https://user-images.githubusercontent.com/9700541/51164265-a7548180-1859-11e9-8660-4fc5d2609d4d.png";>
   
   **2. Release Note**
   https://user-images.githubusercontent.com/9700541/51164272-aae80880-1859-11e9-99bc-cf8933bef57f.png";>
   
   **3. Download**
   https://user-images.githubusercontent.com/9700541/51164281-ade2f900-1859-11e9-93c6-1e97e593f8cc.png";>
   
   
   In addition, the release link (https://s.apache.org/spark-2.2.3) was created 
already for this and the mirror 2.2.2 will be removed tomorrow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Diff for: [GitHub] asfgit closed pull request #23533: [CORE][MINOR] Fix some typos about MemoryMode

2019-01-14 Thread GitBox
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java
index 2cd39bd60c2ac..305cc1c5d1115 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java
@@ -23,7 +23,7 @@
 /**
  * An array of long values. Compared with native JVM arrays, this:
  * 
- *   supports using both in-heap and off-heap memory
+ *   supports using both on-heap and off-heap memory
  *   has no bound checking, and thus can crash the JVM process when assert 
is turned off
  * 
  */
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java
 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java
index 74ebc87dc978c..897b8a2b7ec50 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java
@@ -21,7 +21,7 @@
 
 /**
  * A memory location. Tracked either by a memory address (with off-heap 
allocation),
- * or by an offset from a JVM object (in-heap allocation).
+ * or by an offset from a JVM object (on-heap allocation).
  */
 public class MemoryLocation {
 
diff --git a/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java 
b/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
index 28b646ba3c951..1d9391845be5f 100644
--- a/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
+++ b/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
@@ -85,9 +85,9 @@
   /**
* Similar to an operating system's page table, this array maps page numbers 
into base object
* pointers, allowing us to translate between the hashtable's internal 
64-bit address
-   * representation and the baseObject+offset representation which we use to 
support both in- and
+   * representation and the baseObject+offset representation which we use to 
support both on- and
* off-heap addresses. When using an off-heap allocator, every entry in this 
map will be `null`.
-   * When using an in-heap allocator, the entries in this map will point to 
pages' base objects.
+   * When using an on-heap allocator, the entries in this map will point to 
pages' base objects.
* Entries are added to this map as new data pages are allocated.
*/
   private final MemoryBlock[] pageTable = new MemoryBlock[PAGE_TABLE_SIZE];
@@ -102,7 +102,7 @@
   private final long taskAttemptId;
 
   /**
-   * Tracks whether we're in-heap or off-heap. For off-heap, we short-circuit 
most of these methods
+   * Tracks whether we're on-heap or off-heap. For off-heap, we short-circuit 
most of these methods
* without doing any masking or lookups. Since this branching should be 
well-predicted by the JIT,
* this extra layer of indirection / abstraction hopefully shouldn't be too 
expensive.
*/


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [CORE][MINOR] Fix some typos about MemoryMode

2019-01-14 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a77505d  [CORE][MINOR] Fix some typos about MemoryMode
a77505d is described below

commit a77505d4d3e5db4413d7fa76610265792d949c64
Author: SongYadong 
AuthorDate: Tue Jan 15 14:40:00 2019 +0800

[CORE][MINOR] Fix some typos about MemoryMode

## What changes were proposed in this pull request?

Fix typos in comments by replacing "in-heap" with "on-heap".

## How was this patch tested?

Existing Tests.

Closes #23533 from SongYadong/typos_inheap_to_onheap.

Authored-by: SongYadong 
Signed-off-by: Hyukjin Kwon 
---
 .../src/main/java/org/apache/spark/unsafe/array/LongArray.java  | 2 +-
 .../main/java/org/apache/spark/unsafe/memory/MemoryLocation.java| 2 +-
 core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java   | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java
index 2cd39bd..305cc1c 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java
@@ -23,7 +23,7 @@ import org.apache.spark.unsafe.memory.MemoryBlock;
 /**
  * An array of long values. Compared with native JVM arrays, this:
  * 
- *   supports using both in-heap and off-heap memory
+ *   supports using both on-heap and off-heap memory
  *   has no bound checking, and thus can crash the JVM process when assert 
is turned off
  * 
  */
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java
 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java
index 74ebc87..897b8a2 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java
@@ -21,7 +21,7 @@ import javax.annotation.Nullable;
 
 /**
  * A memory location. Tracked either by a memory address (with off-heap 
allocation),
- * or by an offset from a JVM object (in-heap allocation).
+ * or by an offset from a JVM object (on-heap allocation).
  */
 public class MemoryLocation {
 
diff --git a/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java 
b/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
index 28b646b..1d93918 100644
--- a/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
+++ b/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
@@ -85,9 +85,9 @@ public class TaskMemoryManager {
   /**
* Similar to an operating system's page table, this array maps page numbers 
into base object
* pointers, allowing us to translate between the hashtable's internal 
64-bit address
-   * representation and the baseObject+offset representation which we use to 
support both in- and
+   * representation and the baseObject+offset representation which we use to 
support both on- and
* off-heap addresses. When using an off-heap allocator, every entry in this 
map will be `null`.
-   * When using an in-heap allocator, the entries in this map will point to 
pages' base objects.
+   * When using an on-heap allocator, the entries in this map will point to 
pages' base objects.
* Entries are added to this map as new data pages are allocated.
*/
   private final MemoryBlock[] pageTable = new MemoryBlock[PAGE_TABLE_SIZE];
@@ -102,7 +102,7 @@ public class TaskMemoryManager {
   private final long taskAttemptId;
 
   /**
-   * Tracks whether we're in-heap or off-heap. For off-heap, we short-circuit 
most of these methods
+   * Tracks whether we're on-heap or off-heap. For off-heap, we short-circuit 
most of these methods
* without doing any masking or lookups. Since this branching should be 
well-predicted by the JIT,
* this extra layer of indirection / abstraction hopefully shouldn't be too 
expensive.
*/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Diff for: [GitHub] cloud-fan closed pull request #23543: [SPARK-25935][SQL] Allow null rows for bad records from JSON/CSV parsers

2019-01-14 Thread GitBox
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index a1805f57b1dcf..88f2286219525 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -1694,7 +1694,7 @@ test_that("column functions", {
 
   # check for unparseable
   df <- as.DataFrame(list(list("a" = "")))
-  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]]$a, NA)
+  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
 
   # check if array type in string is correctly supported.
   jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index fce0b9a5f86a0..5d3d4c6ece39d 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -17,8 +17,6 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
 
-  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independently of 
its mode if there is no valid root JSON token in its input (` ` for example). 
Since Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
-
   - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
   - In Spark version 2.4 and earlier, users can create map values with map 
type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since 
Spark 3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can still read map values with map type key from data 
source or Java/Scala collections, though they are not very useful.
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index e0cab537ce1c6..3403349c8974e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -548,23 +548,15 @@ case class JsonToStructs(
   s"Input schema ${nullableSchema.catalogString} must be a struct, an 
array or a map.")
   }
 
-  @transient
-  private lazy val castRow = nullableSchema match {
-case _: StructType => (row: InternalRow) => row
-case _: ArrayType => (row: InternalRow) => row.getArray(0)
-case _: MapType => (row: InternalRow) => row.getMap(0)
-  }
-
   // This converts parsed rows to the desired output by the given schema.
-  private def convertRow(rows: Iterator[InternalRow]) = {
-if (rows.hasNext) {
-  val result = rows.next()
-  // JSON's parser produces one record only.
-  assert(!rows.hasNext)
-  castRow(result)
-} else {
-  throw new IllegalArgumentException("Expected one row from JSON parser.")
-}
+  @transient
+  lazy val converter = nullableSchema match {
+case _: StructType =>
+  (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
+case _: ArrayType =>
+  (rows: Iterator[InternalRow]) => if (rows.hasNext) 
rows.next().getArray(0) else null
+case _: MapType =>
+  (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) 
else null
   }
 
   val nameOfCorruptRecord = 
SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD)
@@ -600,7 +592,7 @@ case class JsonToStructs(
 copy(timeZoneId = Option(timeZoneId))
 
   override def nullSafeEval(json: Any): Any = {
-convertRow(parser.parse(json.asInstanceOf[UTF8String]))
+converter(parser.parse(json.asInstanceOf[UTF8String]))
   }
 
   override def inputTypes: Seq[AbstractDataType] = StringType :: Nil
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index 3f245e1400fa1..8cf758e26e29b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -399,7 +399,7 @@ class JacksonParser(
 // a null first token is equivalent to testing for input.trim.isEmpty
 // but it works on any token strea

[spark] Diff for: [GitHub] cloud-fan closed pull request #23325: [SPARK-26376][SQL] Skip inputs without tokens by JSON datasource

2019-01-14 Thread GitBox
diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index 0fcdd420bcfe3..8cb7ee78b00d2 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -17,7 +17,7 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
 
-  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independently of 
its mode if there is no valid root JSON token in its input (` ` for example). 
Since Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
+  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings without valid root JSON tokens (` ` for example). 
Since Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
 
   - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index e0cab537ce1c6..27b6a63956198 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -583,7 +583,11 @@ case class JsonToStructs(
 (StructType(StructField("value", other) :: Nil), other)
 }
 
-val rawParser = new JacksonParser(actualSchema, parsedOptions, 
allowArrayAsStructs = false)
+val rawParser = new JacksonParser(
+  actualSchema,
+  parsedOptions,
+  allowArrayAsStructs = false,
+  skipInputWithoutTokens = false)
 val createParser = CreateJacksonParser.utf8String _
 
 new FailureSafeParser[UTF8String](
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index 3f245e1400fa1..0f206f843cc6f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -40,7 +40,8 @@ import org.apache.spark.util.Utils
 class JacksonParser(
 schema: DataType,
 val options: JSONOptions,
-allowArrayAsStructs: Boolean) extends Logging {
+allowArrayAsStructs: Boolean,
+skipInputWithoutTokens: Boolean) extends Logging {
 
   import JacksonUtils._
   import com.fasterxml.jackson.core.JsonToken._
@@ -399,6 +400,7 @@ class JacksonParser(
 // a null first token is equivalent to testing for input.trim.isEmpty
 // but it works on any token stream and not just strings
 parser.nextToken() match {
+  case null if skipInputWithoutTokens => Nil
   case null => throw new RuntimeException("Not found any JSON token")
   case _ => rootConverter.apply(parser) match {
 case null => throw new RuntimeException("Root converter returned 
null")
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index ce8e4c8f5b82b..96b3897f1d038 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -455,7 +455,11 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 
 val createParser = CreateJacksonParser.string _
 val parsed = jsonDataset.rdd.mapPartitions { iter =>
-  val rawParser = new JacksonParser(actualSchema, parsedOptions, 
allowArrayAsStructs = true)
+  val rawParser = new JacksonParser(
+actualSchema,
+parsedOptions,
+allowArrayAsStructs = true,
+skipInputWithoutTokens = true)
   val parser = new FailureSafeParser[String](
 input => rawParser.parse(input, createParser, UTF8String.fromString),
 parsedOptions.parseMode,
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
 
b

svn commit: r31963 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_14_20_51-abc937b-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-14 Thread pwendell
Author: pwendell
Date: Tue Jan 15 05:03:26 2019
New Revision: 31963

Log:
Apache Spark 3.0.0-SNAPSHOT-2019_01_14_20_51-abc937b docs


[This commit notification would consist of 1775 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (abc937b -> 33b5039)

2019-01-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from abc937b  [MINOR][BUILD] Remove binary license/notice files in a source 
release for branch-2.4+ only
 add 33b5039  [SPARK-25935][SQL] Allow null rows for bad records from 
JSON/CSV parsers

No new revisions were added by this update.

Summary of changes:
 R/pkg/tests/fulltests/test_sparkSQL.R  |  2 +-
 docs/sql-migration-guide-upgrade.md|  2 --
 .../sql/catalyst/expressions/jsonExpressions.scala | 26 --
 .../spark/sql/catalyst/json/JacksonParser.scala|  2 +-
 .../expressions/JsonExpressionsSuite.scala |  2 +-
 .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 10 +
 .../sql/execution/datasources/json/JsonSuite.scala |  7 +-
 7 files changed, 23 insertions(+), 28 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r31962 - in /dev/spark/2.4.1-SNAPSHOT-2019_01_14_18_37-743dedb-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-14 Thread pwendell
Author: pwendell
Date: Tue Jan 15 02:51:25 2019
New Revision: 31962

Log:
Apache Spark 2.4.1-SNAPSHOT-2019_01_14_18_37-743dedb docs


[This commit notification would consist of 1476 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Diff for: [GitHub] srowen closed pull request #23538: [MINOR][BUILD] Remove binary license/notice files in a source release for branch-2.4+ only

2019-01-14 Thread GitBox
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 2fdb5c8dd38a1..3b5f9ef2afe8f 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -176,10 +176,14 @@ if [[ "$1" == "package" ]]; then
   # Source and binary tarballs
   echo "Packaging release source tarballs"
   cp -r spark spark-$SPARK_VERSION
-  # For source release, exclude copy of binary license/notice
-  rm spark-$SPARK_VERSION/LICENSE-binary
-  rm spark-$SPARK_VERSION/NOTICE-binary
-  rm -r spark-$SPARK_VERSION/licenses-binary
+
+  # For source release in v2.4+, exclude copy of binary license/notice
+  if [[ $SPARK_VERSION > "2.4" ]]; then
+rm spark-$SPARK_VERSION/LICENSE-binary
+rm spark-$SPARK_VERSION/NOTICE-binary
+rm -r spark-$SPARK_VERSION/licenses-binary
+  fi
+
   tar cvzf spark-$SPARK_VERSION.tgz spark-$SPARK_VERSION
   echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour --output 
spark-$SPARK_VERSION.tgz.asc \
 --detach-sig spark-$SPARK_VERSION.tgz


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-2.4 updated: [MINOR][BUILD] Remove binary license/notice files in a source release for branch-2.4+ only

2019-01-14 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 743dedb  [MINOR][BUILD] Remove binary license/notice files in a source 
release for branch-2.4+ only
743dedb is described below

commit 743dedb7a8bf24bf7cd3d8b68add20875623b1c4
Author: Takeshi Yamamuro 
AuthorDate: Mon Jan 14 19:17:39 2019 -0600

[MINOR][BUILD] Remove binary license/notice files in a source release for 
branch-2.4+ only

## What changes were proposed in this pull request?
To skip some steps to remove binary license/notice files in a source 
release for branch2.3 (these files only exist in master/branch-2.4 now), this 
pr checked a Spark release version in `dev/create-release/release-build.sh`.

## How was this patch tested?
Manually checked.

Closes #23538 from maropu/FixReleaseScript.

Authored-by: Takeshi Yamamuro 
Signed-off-by: Sean Owen 
(cherry picked from commit abc937b24756e5d7479bac7229b0b4c1dc82efeb)
Signed-off-by: Sean Owen 
---
 dev/create-release/release-build.sh | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 02c4193..5e65d99 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -174,10 +174,14 @@ if [[ "$1" == "package" ]]; then
   # Source and binary tarballs
   echo "Packaging release source tarballs"
   cp -r spark spark-$SPARK_VERSION
-  # For source release, exclude copy of binary license/notice
-  rm spark-$SPARK_VERSION/LICENSE-binary
-  rm spark-$SPARK_VERSION/NOTICE-binary
-  rm -r spark-$SPARK_VERSION/licenses-binary
+
+  # For source release in v2.4+, exclude copy of binary license/notice
+  if [[ $SPARK_VERSION > "2.4" ]]; then
+rm spark-$SPARK_VERSION/LICENSE-binary
+rm spark-$SPARK_VERSION/NOTICE-binary
+rm -r spark-$SPARK_VERSION/licenses-binary
+  fi
+
   tar cvzf spark-$SPARK_VERSION.tgz spark-$SPARK_VERSION
   echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour --output 
spark-$SPARK_VERSION.tgz.asc \
 --detach-sig spark-$SPARK_VERSION.tgz


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [MINOR][BUILD] Remove binary license/notice files in a source release for branch-2.4+ only

2019-01-14 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new abc937b  [MINOR][BUILD] Remove binary license/notice files in a source 
release for branch-2.4+ only
abc937b is described below

commit abc937b24756e5d7479bac7229b0b4c1dc82efeb
Author: Takeshi Yamamuro 
AuthorDate: Mon Jan 14 19:17:39 2019 -0600

[MINOR][BUILD] Remove binary license/notice files in a source release for 
branch-2.4+ only

## What changes were proposed in this pull request?
To skip some steps to remove binary license/notice files in a source 
release for branch2.3 (these files only exist in master/branch-2.4 now), this 
pr checked a Spark release version in `dev/create-release/release-build.sh`.

## How was this patch tested?
Manually checked.

Closes #23538 from maropu/FixReleaseScript.

Authored-by: Takeshi Yamamuro 
Signed-off-by: Sean Owen 
---
 dev/create-release/release-build.sh | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 2fdb5c8..3b5f9ef 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -176,10 +176,14 @@ if [[ "$1" == "package" ]]; then
   # Source and binary tarballs
   echo "Packaging release source tarballs"
   cp -r spark spark-$SPARK_VERSION
-  # For source release, exclude copy of binary license/notice
-  rm spark-$SPARK_VERSION/LICENSE-binary
-  rm spark-$SPARK_VERSION/NOTICE-binary
-  rm -r spark-$SPARK_VERSION/licenses-binary
+
+  # For source release in v2.4+, exclude copy of binary license/notice
+  if [[ $SPARK_VERSION > "2.4" ]]; then
+rm spark-$SPARK_VERSION/LICENSE-binary
+rm spark-$SPARK_VERSION/NOTICE-binary
+rm -r spark-$SPARK_VERSION/licenses-binary
+  fi
+
   tar cvzf spark-$SPARK_VERSION.tgz spark-$SPARK_VERSION
   echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour --output 
spark-$SPARK_VERSION.tgz.asc \
 --detach-sig spark-$SPARK_VERSION.tgz


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r31961 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_14_16_23-bafc7ac-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-14 Thread pwendell
Author: pwendell
Date: Tue Jan 15 00:35:20 2019
New Revision: 31961

Log:
Apache Spark 3.0.0-SNAPSHOT-2019_01_14_16_23-bafc7ac docs


[This commit notification would consist of 1775 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Diff for: [GitHub] asfgit closed pull request #23301: [SPARK-26350][SS]Allow to override group id of the Kafka consumer

2019-01-14 Thread GitBox
diff --git a/docs/structured-streaming-kafka-integration.md 
b/docs/structured-streaming-kafka-integration.md
index 7040f8da2c614..88a772610578d 100644
--- a/docs/structured-streaming-kafka-integration.md
+++ b/docs/structured-streaming-kafka-integration.md
@@ -379,7 +379,25 @@ The following configurations are optional:
   string
   spark-kafka-source
   streaming and batch
-  Prefix of consumer group identifiers (`group.id`) that are generated by 
structured streaming queries
+  Prefix of consumer group identifiers (`group.id`) that are generated by 
structured streaming
+  queries. If "kafka.group.id" is set, this option will be ignored. 
+
+
+  kafka.group.id
+  string
+  none
+  streaming and batch
+  The Kafka group id to use in Kafka consumer while reading from Kafka. 
Use this with caution.
+  By default, each query generates a unique group id for reading data. This 
ensures that each Kafka
+  source has its own consumer group that does not face interference from any 
other consumer, and
+  therefore can read all of the partitions of its subscribed topics. In some 
scenarios (for example,
+  Kafka group-based authorization), you may want to use a specific authorized 
group id to read data.
+  You can optionally set the group id. However, do this with extreme caution 
as it can cause
+  unexpected behavior. Concurrently running queries (both, batch and 
streaming) or sources with the
+  same group id are likely interfere with each other causing each query to 
read only part of the
+  data. This may also occur when queries are started/restarted in quick 
succession. To minimize such
+  issues, set the Kafka consumer session timeout (by setting option 
"kafka.session.timeout.ms") to
+  be very small. When this is set, option "groupIdPrefix" will be ignored. 

 
 
 
@@ -592,8 +610,9 @@ for parameters related to writing data.
 Note that the following Kafka params cannot be set and the Kafka source or 
sink will throw an exception:
 
 - **group.id**: Kafka source will create a unique group id for each query 
automatically. The user can
-set the prefix of the automatically generated group.id's via the optional 
source option `groupIdPrefix`, default value
-is "spark-kafka-source".
+set the prefix of the automatically generated group.id's via the optional 
source option `groupIdPrefix`,
+default value is "spark-kafka-source". You can also set "kafka.group.id" to 
force Spark to use a special
+group id, however, please read warnings for this option and use it with 
caution.
 - **auto.offset.reset**: Set the source option `startingOffsets` to specify
  where to start instead. Structured Streaming manages which offsets are 
consumed internally, rather
  than rely on the kafka Consumer to do it. This will ensure that no data is 
missed when new
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReadSupport.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReadSupport.scala
index 1753a28fba2fb..e2f476bd81eb8 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReadSupport.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReadSupport.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.kafka010
 import java.{util => ju}
 import java.util.concurrent.TimeoutException
 
-import org.apache.kafka.clients.consumer.{ConsumerRecord, 
OffsetOutOfRangeException}
+import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, 
OffsetOutOfRangeException}
 import org.apache.kafka.common.TopicPartition
 
 import org.apache.spark.TaskContext
@@ -167,7 +167,13 @@ class KafkaContinuousScanConfigBuilder(
 
 val deletedPartitions = 
oldStartPartitionOffsets.keySet.diff(currentPartitionSet)
 if (deletedPartitions.nonEmpty) {
-  reportDataLoss(s"Some partitions were deleted: $deletedPartitions")
+  val message = if (
+  
offsetReader.driverKafkaParams.containsKey(ConsumerConfig.GROUP_ID_CONFIG)) {
+s"$deletedPartitions are gone. 
${KafkaSourceProvider.CUSTOM_GROUP_ID_ERROR_MESSAGE}"
+  } else {
+s"$deletedPartitions are gone. Some data may have been missed."
+  }
+  reportDataLoss(message)
 }
 
 val startOffsets = newPartitionOffsets ++
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReadSupport.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReadSupport.scala
index bb4de674c3c72..a5b5690e5be15 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReadSupport.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReadSupport.scala
@@ -22,6 +22,7 @@ import java.io._
 import java.nio.charset.StandardCharsets
 
 import org.apache.commons.io.IOUtils
+import org.apache.kafka.clients.consumer.ConsumerConfig
 
 i

[spark] branch master updated: [SPARK-26350][SS] Allow to override group id of the Kafka consumer

2019-01-14 Thread zsxwing
This is an automated email from the ASF dual-hosted git repository.

zsxwing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bafc7ac  [SPARK-26350][SS] Allow to override group id of the Kafka 
consumer
bafc7ac is described below

commit bafc7ac0259d5ea92b33a4005a28701c0c960a98
Author: Shixiong Zhu 
AuthorDate: Mon Jan 14 13:37:24 2019 -0800

[SPARK-26350][SS] Allow to override group id of the Kafka consumer

## What changes were proposed in this pull request?

This PR allows the user to override `kafka.group.id` for better monitoring 
or security. The user needs to make sure there are not multiple queries or 
sources using the same group id.

It also fixes a bug that the `groupIdPrefix` option cannot be retrieved.

## How was this patch tested?

The new added unit tests.

Closes #23301 from zsxwing/SPARK-26350.

Authored-by: Shixiong Zhu 
Signed-off-by: Shixiong Zhu 
---
 docs/structured-streaming-kafka-integration.md | 25 ---
 .../sql/kafka010/KafkaContinuousReadSupport.scala  | 10 ++--
 .../sql/kafka010/KafkaMicroBatchReadSupport.scala  |  9 ++-
 .../spark/sql/kafka010/KafkaOffsetReader.scala |  6 +++--
 .../apache/spark/sql/kafka010/KafkaSource.scala|  8 ++-
 .../spark/sql/kafka010/KafkaSourceProvider.scala   | 24 ++-
 .../sql/kafka010/KafkaMicroBatchSourceSuite.scala  | 28 +-
 .../spark/sql/kafka010/KafkaRelationSuite.scala| 12 ++
 8 files changed, 106 insertions(+), 16 deletions(-)

diff --git a/docs/structured-streaming-kafka-integration.md 
b/docs/structured-streaming-kafka-integration.md
index 3d64ec4..c19aa5c 100644
--- a/docs/structured-streaming-kafka-integration.md
+++ b/docs/structured-streaming-kafka-integration.md
@@ -379,7 +379,25 @@ The following configurations are optional:
   string
   spark-kafka-source
   streaming and batch
-  Prefix of consumer group identifiers (`group.id`) that are generated by 
structured streaming queries
+  Prefix of consumer group identifiers (`group.id`) that are generated by 
structured streaming
+  queries. If "kafka.group.id" is set, this option will be ignored. 
+
+
+  kafka.group.id
+  string
+  none
+  streaming and batch
+  The Kafka group id to use in Kafka consumer while reading from Kafka. 
Use this with caution.
+  By default, each query generates a unique group id for reading data. This 
ensures that each Kafka
+  source has its own consumer group that does not face interference from any 
other consumer, and
+  therefore can read all of the partitions of its subscribed topics. In some 
scenarios (for example,
+  Kafka group-based authorization), you may want to use a specific authorized 
group id to read data.
+  You can optionally set the group id. However, do this with extreme caution 
as it can cause
+  unexpected behavior. Concurrently running queries (both, batch and 
streaming) or sources with the
+  same group id are likely interfere with each other causing each query to 
read only part of the
+  data. This may also occur when queries are started/restarted in quick 
succession. To minimize such
+  issues, set the Kafka consumer session timeout (by setting option 
"kafka.session.timeout.ms") to
+  be very small. When this is set, option "groupIdPrefix" will be ignored. 

 
 
 
@@ -592,8 +610,9 @@ for parameters related to writing data.
 Note that the following Kafka params cannot be set and the Kafka source or 
sink will throw an exception:
 
 - **group.id**: Kafka source will create a unique group id for each query 
automatically. The user can
-set the prefix of the automatically generated group.id's via the optional 
source option `groupIdPrefix`, default value
-is "spark-kafka-source".
+set the prefix of the automatically generated group.id's via the optional 
source option `groupIdPrefix`,
+default value is "spark-kafka-source". You can also set "kafka.group.id" to 
force Spark to use a special
+group id, however, please read warnings for this option and use it with 
caution.
 - **auto.offset.reset**: Set the source option `startingOffsets` to specify
  where to start instead. Structured Streaming manages which offsets are 
consumed internally, rather
  than rely on the kafka Consumer to do it. This will ensure that no data is 
missed when new
diff --git 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReadSupport.scala
 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReadSupport.scala
index 02dfb9c..f328567 100644
--- 
a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReadSupport.scala
+++ 
b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReadSupport.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.kafka010
 impor

[spark] Diff for: [GitHub] gatorsmile closed pull request #23496: [SPARK-26577][SQL] Add input optimizer when reading Hive table by SparkSQL

2019-01-14 Thread GitBox
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
index cd321d41f43e8..20b1e1abce9c6 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
@@ -120,6 +120,23 @@ private[spark] object HiveUtils extends Logging {
 .toSequence
 .createWithDefault(jdbcPrefixes)
 
+  val HIVE_FILE_INPUT_FORMAT_ENABLED = 
buildConf("spark.sql.hive.fileInputFormat.enabled")
+.doc("When true, enable optimizing the `fileInputFormat` in Spark SQL.")
+.booleanConf
+.createWithDefault(false)
+
+  val HIVE_FILE_INPUT_FORMAT_SPLIT_MAXSIZE =
+buildConf("spark.sql.hive.fileInputFormat.split.maxsize")
+  .doc("The maxsize of per split while reading Hive tables.")
+  .longConf
+  .createWithDefault(128 * 1024 * 1024)
+
+  val HIVE_FILE_INPUT_FORMAT_SPLIT_MINSIZE =
+buildConf("spark.sql.hive.fileInputFormat.split.minsize")
+  .doc("The minsize of per split while reading Hive tables.")
+  .longConf
+  .createWithDefault(32 * 1024 * 1024)
+
   private def jdbcPrefixes = Seq(
 "com.mysql.jdbc", "org.postgresql", "com.microsoft.sqlserver", 
"oracle.jdbc")
 
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
index 7d57389947576..03abd49af1b56 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
@@ -123,8 +123,7 @@ class HadoopTableReader(
 val inputPathStr = applyFilterIfNeeded(tablePath, filterOpt)
 
 // logDebug("Table input: %s".format(tablePath))
-val ifc = hiveTable.getInputFormatClass
-  .asInstanceOf[java.lang.Class[InputFormat[Writable, Writable]]]
+val ifc = getAndOptimizeInput(hiveTable.getInputFormatClass.getName)
 val hadoopRDD = createHadoopRdd(localTableDesc, inputPathStr, ifc)
 
 val attrsWithIndex = attributes.zipWithIndex
@@ -164,7 +163,7 @@ class HadoopTableReader(
 def verifyPartitionPath(
 partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]):
 Map[HivePartition, Class[_ <: Deserializer]] = {
-  if (!sparkSession.sessionState.conf.verifyPartitionPath) {
+  if (!conf.verifyPartitionPath) {
 partitionToDeserializer
   } else {
 val existPathSet = collection.mutable.Set[String]()
@@ -202,8 +201,7 @@ class HadoopTableReader(
   val partDesc = Utilities.getPartitionDesc(partition)
   val partPath = partition.getDataLocation
   val inputPathStr = applyFilterIfNeeded(partPath, filterOpt)
-  val ifc = partDesc.getInputFileFormatClass
-.asInstanceOf[java.lang.Class[InputFormat[Writable, Writable]]]
+  val ifc = getAndOptimizeInput(partDesc.getInputFileFormatClassName)
   // Get partition field info
   val partSpec = partDesc.getPartSpec
   val partProps = partDesc.getProperties
@@ -311,6 +309,36 @@ class HadoopTableReader(
 // Only take the value (skip the key) because Hive works only with values.
 rdd.map(_._2)
   }
+
+  /**
+   * If `spark.sql.hive.fileInputFormat.enabled` is true, this function will 
optimize the input
+   * method(including format and the size of splits) while reading Hive tables.
+   */
+  private def getAndOptimizeInput(
+inputClassName: String): Class[InputFormat[Writable, Writable]] = {
+
+var ifc = Utils.classForName(inputClassName)
+  .asInstanceOf[java.lang.Class[InputFormat[Writable, Writable]]]
+if (conf.getConf(HiveUtils.HIVE_FILE_INPUT_FORMAT_ENABLED)) {
+  hadoopConf.set("mapreduce.input.fileinputformat.split.maxsize",
+conf.getConf(HiveUtils.HIVE_FILE_INPUT_FORMAT_SPLIT_MAXSIZE).toString)
+  hadoopConf.set("mapreduce.input.fileinputformat.split.minsize",
+conf.getConf(HiveUtils.HIVE_FILE_INPUT_FORMAT_SPLIT_MINSIZE).toString)
+  if ("org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
+.equals(inputClassName)) {
+ifc = Utils.classForName(
+  "org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat")
+  .asInstanceOf[java.lang.Class[InputFormat[Writable, Writable]]]
+  }
+  if ("org.apache.hadoop.mapred.TextInputFormat"
+.equals(inputClassName)) {
+ifc = Utils.classForName(
+  "org.apache.hadoop.mapred.lib.CombineTextInputFormat")
+  .asInstanceOf[java.lang.Class[InputFormat[Writable, Writable]]]
+  }
+}
+ifc
+  }
 }
 
 private[hive] object HiveTableUtil {
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala
index 3f9bb8de42e09..8db8e1eaab5d7 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala
+++ 
b/sql/hive

[spark] Diff for: [GitHub] gatorsmile closed pull request #23535: [SPARK-26613][SQL] Add another rename table grammar for spark sql

2019-01-14 Thread GitBox
diff --git 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
index b39681d886c5c..de18fc93cba2d 100644
--- 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
+++ 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
@@ -105,6 +105,7 @@ statement
 ADD COLUMNS '(' columns=colTypeList ')'
#addTableColumns
 | ALTER (TABLE | VIEW) from=tableIdentifier
 RENAME TO to=tableIdentifier   
#renameTable
+| RENAME (TABLE | VIEW) from=tableIdentifier TO to=tableIdentifier 
#renameTable
 | ALTER (TABLE | VIEW) tableIdentifier
 SET TBLPROPERTIES tablePropertyList
#setTableProperties
 | ALTER (TABLE | VIEW) tableIdentifier
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala
index e0ccae15f1d05..0d8e75947c1a8 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala
@@ -636,8 +636,12 @@ class DDLParserSuite extends PlanTest with 
SharedSQLContext {
   test("alter table/view: rename table/view") {
 val sql_table = "ALTER TABLE table_name RENAME TO new_table_name"
 val sql_view = sql_table.replace("TABLE", "VIEW")
+val sql_table2 = "RENAME TABLE table_name TO new_table_name"
+val sql_view2 = sql_table2.replace("TABLE", "VIEW")
 val parsed_table = parser.parsePlan(sql_table)
 val parsed_view = parser.parsePlan(sql_view)
+val parsed_table2 = parser.parsePlan(sql_table2)
+val parsed_view2 = parser.parsePlan(sql_view2)
 val expected_table = AlterTableRenameCommand(
   TableIdentifier("table_name"),
   TableIdentifier("new_table_name"),
@@ -646,15 +650,29 @@ class DDLParserSuite extends PlanTest with 
SharedSQLContext {
   TableIdentifier("table_name"),
   TableIdentifier("new_table_name"),
   isView = true)
+val expected_table2 = AlterTableRenameCommand(
+  TableIdentifier("table_name"),
+  TableIdentifier("new_table_name"),
+  isView = false)
+val expected_view2 = AlterTableRenameCommand(
+  TableIdentifier("table_name"),
+  TableIdentifier("new_table_name"),
+  isView = true)
 comparePlans(parsed_table, expected_table)
 comparePlans(parsed_view, expected_view)
+comparePlans(parsed_table2, expected_table2)
+comparePlans(parsed_view2, expected_view2)
   }
 
   test("alter table: rename table with database") {
 val query = "ALTER TABLE db1.tbl RENAME TO db1.tbl2"
+val query2 = "RENAME TABLE db1.tbl TO db1.tbl2"
 val plan = parseAs[AlterTableRenameCommand](query)
+val plan2 = parseAs[AlterTableRenameCommand](query2)
 assert(plan.oldName == TableIdentifier("tbl", Some("db1")))
 assert(plan.newName == TableIdentifier("tbl2", Some("db1")))
+assert(plan2.oldName == TableIdentifier("tbl", Some("db1")))
+assert(plan2.newName == TableIdentifier("tbl2", Some("db1")))
   }
 
   // ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
index 052a5e757c445..d5012cdc6e3cc 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
@@ -945,6 +945,37 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("rename table to") {
+val catalog = spark.sessionState.catalog
+val tableIdent1 = TableIdentifier("tab1", Some("dbx"))
+val tableIdent2 = TableIdentifier("tab2", Some("dbx"))
+createDatabase(catalog, "dbx")
+createDatabase(catalog, "dby")
+createTable(catalog, tableIdent1)
+
+assert(catalog.listTables("dbx") == Seq(tableIdent1))
+sql("RENAME TABLE dbx.tab1 TO dbx.tab2")
+assert(catalog.listTables("dbx") == Seq(tableIdent2))
+
+// The database in destination table name can be omitted, and we will use 
the database of source
+// table for it.
+sql("RENAME TABLE dbx.tab2 TO tab1")
+assert(catalog.listTables("dbx") == Seq(tableIdent1))
+
+catalog.setCurrentDatabase("dbx")
+// rename without explicitly specifying database
+sql("RENAME TABLE tab1 TO tab2")
+assert(catalog.listTables("dbx") == Seq(tableIdent2))
+// table to rename does not exist
+intercept[AnalysisException] {
+  sql("RENAME TABLE dbx.does_not_exist TO dbx.tab2")
+}
+// destination database is different
+intercept[AnalysisException] {
+  sql("RENAME TABLE dbx.tab1 TO dby.tab2")
+}

svn commit: r31954 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_14_06_38-115fecf-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-14 Thread pwendell
Author: pwendell
Date: Mon Jan 14 14:51:04 2019
New Revision: 31954

Log:
Apache Spark 3.0.0-SNAPSHOT-2019_01_14_06_38-115fecf docs


[This commit notification would consist of 1775 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Diff for: [GitHub] cloud-fan closed pull request #23391: [SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter

2019-01-14 Thread GitBox
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
index ee463bf5eb6ac..ff6a68b290206 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
@@ -230,12 +230,15 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   // [[func]] assumes the input is no longer null because eval already does 
the null check.
   @inline private[this] def buildCast[T](a: Any, func: T => Any): Any = 
func(a.asInstanceOf[T])
 
+  private lazy val dateFormatter = DateFormatter()
+  private lazy val timestampFormatter = TimestampFormatter(timeZone)
+
   // UDFToString
   private[this] def castToString(from: DataType): Any => Any = from match {
 case BinaryType => buildCast[Array[Byte]](_, UTF8String.fromBytes)
-case DateType => buildCast[Int](_, d => 
UTF8String.fromString(DateTimeUtils.dateToString(d)))
+case DateType => buildCast[Int](_, d => 
UTF8String.fromString(dateFormatter.format(d)))
 case TimestampType => buildCast[Long](_,
-  t => UTF8String.fromString(DateTimeUtils.timestampToString(t, timeZone)))
+  t => 
UTF8String.fromString(DateTimeUtils.timestampToString(timestampFormatter, t)))
 case ArrayType(et, _) =>
   buildCast[ArrayData](_, array => {
 val builder = new UTF8StringBuilder
@@ -843,12 +846,16 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   case BinaryType =>
 (c, evPrim, evNull) => code"$evPrim = UTF8String.fromBytes($c);"
   case DateType =>
-(c, evPrim, evNull) => code"""$evPrim = UTF8String.fromString(
-  
org.apache.spark.sql.catalyst.util.DateTimeUtils.dateToString($c));"""
+val df = JavaCode.global(
+  ctx.addReferenceObj("dateFormatter", dateFormatter),
+  dateFormatter.getClass)
+(c, evPrim, evNull) => code"""$evPrim = 
UTF8String.fromString(${df}.format($c));"""
   case TimestampType =>
-val tz = JavaCode.global(ctx.addReferenceObj("timeZone", timeZone), 
timeZone.getClass)
+val tf = JavaCode.global(
+  ctx.addReferenceObj("timestampFormatter", timestampFormatter),
+  timestampFormatter.getClass)
 (c, evPrim, evNull) => code"""$evPrim = UTF8String.fromString(
-  
org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString($c, $tz));"""
+  
org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString($tf, $c));"""
   case ArrayType(et, _) =>
 (c, evPrim, evNull) => {
   val buffer = ctx.freshVariable("buffer", classOf[UTF8StringBuilder])
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
index 8fc0112c02577..e173f8091f869 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
@@ -562,7 +562,7 @@ case class DateFormatClass(left: Expression, right: 
Expression, timeZoneId: Opti
 copy(timeZoneId = Option(timeZoneId))
 
   override protected def nullSafeEval(timestamp: Any, format: Any): Any = {
-val df = TimestampFormatter(format.toString, timeZone, Locale.US)
+val df = TimestampFormatter(format.toString, timeZone)
 UTF8String.fromString(df.format(timestamp.asInstanceOf[Long]))
   }
 
@@ -667,7 +667,7 @@ abstract class UnixTime
   private lazy val constFormat: UTF8String = 
right.eval().asInstanceOf[UTF8String]
   private lazy val formatter: TimestampFormatter =
 try {
-  TimestampFormatter(constFormat.toString, timeZone, Locale.US)
+  TimestampFormatter(constFormat.toString, timeZone)
 } catch {
   case NonFatal(_) => null
 }
@@ -700,7 +700,7 @@ abstract class UnixTime
   } else {
 val formatString = f.asInstanceOf[UTF8String].toString
 try {
-  TimestampFormatter(formatString, timeZone, Locale.US).parse(
+  TimestampFormatter(formatString, timeZone).parse(
 t.asInstanceOf[UTF8String].toString) / MICROS_PER_SECOND
 } catch {
   case NonFatal(_) => null
@@ -821,7 +821,7 @@ case class FromUnixTime(sec: Expression, format: 
Expression, timeZoneId: Option[
   private lazy val constFormat: UTF8String = 
right.eval().asInstanceOf[UTF8String]
   private lazy val formatter: TimestampFormatter =
 try {
-  TimestampFormatter(constFormat.toString, timeZone, Locale.US)
+  TimestampFormatter(constFormat.toString, timeZone)
 } catch {
   case NonFatal(_) => null
 }
@@ -847,7 +847,7 @@ case class FromUnixTime(sec: Expression, format: 
E

[spark] branch master updated: [SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter

2019-01-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 115fecf  [SPARK-26456][SQL] Cast date/timestamp to string by 
Date/TimestampFormatter
115fecf is described below

commit 115fecfd840d58ce3211bf1dd7b130cb862730a5
Author: Maxim Gekk 
AuthorDate: Mon Jan 14 21:59:25 2019 +0800

[SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter

## What changes were proposed in this pull request?

In the PR, I propose to switch on `TimestampFormatter`/`DateFormatter` in 
casting dates/timestamps to strings. The changes should make the date/timestamp 
casting consistent to JSON/CSV datasources and time-related functions like 
`to_date`, `to_unix_timestamp`/`from_unixtime`.

Local formatters are moved out from `DateTimeUtils` to where they are 
actually used. It allows to avoid re-creation of new formatter instance 
per-each call. Another reason is to have separate parser for 
`PartitioningUtils` because default parsing pattern cannot be used (expected 
optional section `[.S]`).

## How was this patch tested?

It was tested by `DateTimeUtilsSuite`, `CastSuite` and `JDBC*Suite`.

Closes #23391 from MaxGekk/thread-local-date-format.

Lead-authored-by: Maxim Gekk 
Co-authored-by: Maxim Gekk 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/Cast.scala  | 19 ++---
 .../catalyst/expressions/datetimeExpressions.scala | 10 ++---
 .../spark/sql/catalyst/util/DateFormatter.scala|  7 
 .../spark/sql/catalyst/util/DateTimeUtils.scala| 45 +-
 .../sql/catalyst/util/TimestampFormatter.scala | 13 ++-
 .../sql/catalyst/util/DateTimeUtilsSuite.scala |  3 +-
 .../apache/spark/sql/util/DateFormatterSuite.scala | 10 ++---
 .../spark/sql/util/TimestampFormatterSuite.scala   | 13 +++
 .../apache/spark/sql/execution/HiveResult.scala| 12 +++---
 .../execution/datasources/PartitioningUtils.scala  | 43 +++--
 .../execution/datasources/jdbc/JDBCRelation.scala  |  9 +++--
 .../spark/sql/execution/HiveResultSuite.scala  | 21 +-
 .../parquet/ParquetPartitionDiscoverySuite.scala   | 18 ++---
 13 files changed, 125 insertions(+), 98 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
index ee463bf..ff6a68b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
@@ -230,12 +230,15 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   // [[func]] assumes the input is no longer null because eval already does 
the null check.
   @inline private[this] def buildCast[T](a: Any, func: T => Any): Any = 
func(a.asInstanceOf[T])
 
+  private lazy val dateFormatter = DateFormatter()
+  private lazy val timestampFormatter = TimestampFormatter(timeZone)
+
   // UDFToString
   private[this] def castToString(from: DataType): Any => Any = from match {
 case BinaryType => buildCast[Array[Byte]](_, UTF8String.fromBytes)
-case DateType => buildCast[Int](_, d => 
UTF8String.fromString(DateTimeUtils.dateToString(d)))
+case DateType => buildCast[Int](_, d => 
UTF8String.fromString(dateFormatter.format(d)))
 case TimestampType => buildCast[Long](_,
-  t => UTF8String.fromString(DateTimeUtils.timestampToString(t, timeZone)))
+  t => 
UTF8String.fromString(DateTimeUtils.timestampToString(timestampFormatter, t)))
 case ArrayType(et, _) =>
   buildCast[ArrayData](_, array => {
 val builder = new UTF8StringBuilder
@@ -843,12 +846,16 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   case BinaryType =>
 (c, evPrim, evNull) => code"$evPrim = UTF8String.fromBytes($c);"
   case DateType =>
-(c, evPrim, evNull) => code"""$evPrim = UTF8String.fromString(
-  
org.apache.spark.sql.catalyst.util.DateTimeUtils.dateToString($c));"""
+val df = JavaCode.global(
+  ctx.addReferenceObj("dateFormatter", dateFormatter),
+  dateFormatter.getClass)
+(c, evPrim, evNull) => code"""$evPrim = 
UTF8String.fromString(${df}.format($c));"""
   case TimestampType =>
-val tz = JavaCode.global(ctx.addReferenceObj("timeZone", timeZone), 
timeZone.getClass)
+val tf = JavaCode.global(
+  ctx.addReferenceObj("timestampFormatter", timestampFormatter),
+  timestampFormatter.getClass)
 (c, evPrim, evNull) => code"""$evPrim = UTF8String.fromString(
-  
org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString($c, $tz));"""
+ 

[spark] Diff for: [GitHub] rberenguel closed pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes before 1900

2019-01-14 Thread GitBox
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index f0a9a0400e392..71559af624f6c 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1568,6 +1568,18 @@ def test_datetime_at_epoch(self):
 self.assertEqual(first['date'], epoch)
 self.assertEqual(first['lit_date'], epoch)
 
+# regression test for SPARK-20787
+def test_datetype_accepts_calendar_dates(self):
+df1 = 
self.spark.createDataFrame(self.sc.parallelize([[datetime.datetime(1899, 12, 
31)]]))
+df2 = 
self.spark.createDataFrame(self.sc.parallelize([[datetime.datetime(100, 1, 
1)]]))
+try:
+counted1 = df1.count()
+counted2 = df2.count()
+self.assertEqual(counted1, 1)
+self.assertEqual(counted2, 1)
+except Exception:
+self.fail("Internal conversion should handle years 100-1899")
+
 def test_decimal(self):
 from decimal import Decimal
 schema = StructType([StructField("decimal", DecimalType(10, 5))])
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 26b54a7fb3709..d8e10e4d11289 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -187,8 +187,12 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+# Avoiding the invalid range of years (100-1899) for mktime in 
Python < 3
+if dt.year > 1899 or dt.year < 100:
+seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
+   else time.mktime(dt.timetuple()))
+else:
+seconds = calendar.timegm(dt.utctimetuple())
 return int(seconds) * 100 + dt.microsecond
 
 def fromInternal(self, ts):


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org