date:20170517

[GitHub] spark issue #18025: [WIP][SparkR] Update doc and examples for sql functions

2017-05-17 Thread actuaryzhang

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/18025

@felixcheung @HyukjinKwon

Per this
[suggestion](https://github.com/apache/spark/pull/18003#discussion-diff-116853922L57),
I'm creating more meaningful examples for the SQL functions.

Since these functions can be grouped, we can create a single page doc for
each group of the functions and construct concrete and useful examples for each
group. The benefit is obvious:
- Centralized documentation of related functions. This makes it easier for
user to navigate. Right now there are TOO many items in the `see also` section.
- Examples can share the same data. This avoids creating a data frame for
each function if they are documented separately.
- Cleaner structure and much fewer Rd files.

Indeed, this is part of what was discussed in #17161. I have explored this
for a few functions to illustrate the idea. Since this is a big effort, I would
like to get folks' opinions before extending this to all functions.

In this commit, I created docs for some sample functions in three groups:
- 'column_datetime_functions' to document all datetime functions
- 'column_aggregate_functions' to document all aggregate functions
- 'column_math_functions' to document all math functions
- ...

Below is what 'column_datetime_functions.Rd' looks like:

![image](https://cloud.githubusercontent.com/assets/11082368/26189797/426029f0-3b5b-11e7-9175-c63b0e5c0014.png)

![image](https://cloud.githubusercontent.com/assets/11082368/26189810/56630954-3b5b-11e7-9d70-3e74b6d3b032.png)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17997: [SPARK-20763][SQL]The function of `month` and `day` retu...

2017-05-17 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/17997
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18025: [WIP][SparkR] Update doc and examples for sql fun...

2017-05-17 Thread actuaryzhang

GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/18025

[WIP][SparkR] Update doc and examples for sql functions

## What changes were proposed in this pull request?
Create better examples for sql functions. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark sparkRDoc4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18025.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18025


commit 5c8cd1e5da896d78ea3cb4fcf5e046d22090dc2a
Author: Wayne Zhang 
Date:   2017-05-18T06:32:42Z

sql function examples prototype




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18020: [SPARK-20700][SQL] InferFiltersFromConstraints st...

2017-05-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18020


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18020: [SPARK-20700][SQL] InferFiltersFromConstraints stackover...

2017-05-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18020
  
Thanks! Merging to master/2.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...

2017-05-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16989
  
**[Test build #77039 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77039/testReport)**
 for PR 16989 at commit 
[`4ece142`](https://github.com/apache/spark/commit/4ece142d2a3c4b46a712539e3aa7f7ee0d4e6b5b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18011: [SPARK-19089][SQL] Add support for nested sequences

2017-05-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18011
  
**[Test build #77040 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77040/testReport)**
 for PR 18011 at commit 
[`dd3bf01`](https://github.com/apache/spark/commit/dd3bf0113cbf66ebf784f68d7f602c39f4a46b8b).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...

2017-05-17 Thread JoshRosen

Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/16989
  
I think that the current use of `MemoryMode.OFF_HEAP` allocation will cause 
problems in out-of-the-box deployments using the default configurations. In 
Spark's current memory manager implementation the total amount of Spark-managed 
off-heap memory that we will use is controlled by `spark.memory.offHeap.size` 
and the default value is 0. In this PR, the comment on 
`spark.reducer.maxReqSizeShuffleToMem` says that it should be smaller than 
`spark.memory.offHeap.size` and yet the default is 200 megabytes so the default 
configuration is invalid.

Because `preferDirectBufs()` is `true` by default it looks like the code 
here will always try to reserve memory using `MemoryMode.OFF_HEAP` and these 
reservations will always fail in the default configuration because the off-heap 
size will be zero, so I think the net effect of this patch will be to always 
spill to disk.

One way to address this problem is to configure the default value of 
`spark.memory.offHeap.size` to match the JVM's internal limit on the amount of 
direct buffers that it can allocate minus some percentage or fixed overhead. 
Basically the problem is that Spark's off-heap memory manager was originally 
designed to only manage off-heap memory explicitly allocated by Spark itself 
when creating its own buffers / pages or caching blocks, not to account for 
off-heap memory used by lower-level code or third-party libraries. I'll see if 
I can think of a clean way to fix this, which I think will need to be done 
before the defaults used here can work as intended.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-17 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17819
  
ping @MLnick Do you have more comments on this? Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

2017-05-17 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18000
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

2017-05-17 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18000#discussion_r117168737
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -538,6 +538,21 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Disable Parquet predicate pushdown for fields having 
dots in the names") {
--- End diff --

Looks much better now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

2017-05-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18000#discussion_r117168546
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -47,39 +49,47 @@ import org.apache.spark.util.{AccumulatorContext, 
AccumulatorV2}
  *data type is nullable.
  */
 class ParquetFilterSuite extends QueryTest with ParquetTest with 
SharedSQLContext {
--- End diff --

Sure, I just revert it back and made a simple test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18002: [SPARK-20770][SQL] Improve ColumnStats

2017-05-17 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18002#discussion_r117168094
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
 ---
@@ -53,219 +53,299 @@ private[columnar] sealed trait ColumnStats extends 
Serializable {
   /**
* Gathers statistics information from `row(ordinal)`.
*/
-  def gatherStats(row: InternalRow, ordinal: Int): Unit = {
-if (row.isNullAt(ordinal)) {
-  nullCount += 1
-  // 4 bytes for null position
-  sizeInBytes += 4
-}
+  def gatherStats(row: InternalRow, ordinal: Int): Unit
+
+  /**
+   * Gathers statistics information on `null`.
+   */
+  def gatherNullStats(): Unit = {
+nullCount += 1
+// 4 bytes for null position
+sizeInBytes += 4
 count += 1
   }
 
   /**
-   * Column statistics represented as a single row, currently including 
closed lower bound, closed
+   * Column statistics represented as an array, currently including closed 
lower bound, closed
* upper bound and null count.
*/
-  def collectedStatistics: GenericInternalRow
+  def collectedStatistics: Array[Any]
 }
 
 /**
  * A no-op ColumnStats only used for testing purposes.
  */
-private[columnar] class NoopColumnStats extends ColumnStats {
-  override def gatherStats(row: InternalRow, ordinal: Int): Unit = 
super.gatherStats(row, ordinal)
+private[columnar] final class NoopColumnStats extends ColumnStats {
+  override def gatherStats(row: InternalRow, ordinal: Int): Unit = {
+if (!row.isNullAt(ordinal)) {
+  count += 1
+} else {
+  gatherNullStats
+}
+  }
 
-  override def collectedStatistics: GenericInternalRow =
-new GenericInternalRow(Array[Any](null, null, nullCount, count, 0L))
+  override def collectedStatistics: Array[Any] = Array[Any](null, null, 
nullCount, count, 0L)
 }
 
-private[columnar] class BooleanColumnStats extends ColumnStats {
+private[columnar] final class BooleanColumnStats extends ColumnStats {
   protected var upper = false
   protected var lower = true
 
   override def gatherStats(row: InternalRow, ordinal: Int): Unit = {
-super.gatherStats(row, ordinal)
 if (!row.isNullAt(ordinal)) {
   val value = row.getBoolean(ordinal)
-  if (value > upper) upper = value
-  if (value < lower) lower = value
-  sizeInBytes += BOOLEAN.defaultSize
+  gatherValueStats(value)
+} else {
+  gatherNullStats
 }
   }
 
-  override def collectedStatistics: GenericInternalRow =
-new GenericInternalRow(Array[Any](lower, upper, nullCount, count, 
sizeInBytes))
+  def gatherValueStats(value: Boolean): Unit = {
+if (value > upper) upper = value
+if (value < lower) lower = value
+sizeInBytes += BOOLEAN.defaultSize
+count += 1
+  }
+
+  override def collectedStatistics: Array[Any] =
+Array[Any](lower, upper, nullCount, count, sizeInBytes)
 }
 
-private[columnar] class ByteColumnStats extends ColumnStats {
+private[columnar] final class ByteColumnStats extends ColumnStats {
   protected var upper = Byte.MinValue
   protected var lower = Byte.MaxValue
 
   override def gatherStats(row: InternalRow, ordinal: Int): Unit = {
-super.gatherStats(row, ordinal)
 if (!row.isNullAt(ordinal)) {
   val value = row.getByte(ordinal)
-  if (value > upper) upper = value
-  if (value < lower) lower = value
-  sizeInBytes += BYTE.defaultSize
+  gatherValueStats(value)
+} else {
+  gatherNullStats
 }
   }
 
-  override def collectedStatistics: GenericInternalRow =
-new GenericInternalRow(Array[Any](lower, upper, nullCount, count, 
sizeInBytes))
+  def gatherValueStats(value: Byte): Unit = {
+if (value > upper) upper = value
+if (value < lower) lower = value
+sizeInBytes += BYTE.defaultSize
+count += 1
+  }
+
+  override def collectedStatistics: Array[Any] =
+Array[Any](lower, upper, nullCount, count, sizeInBytes)
 }
 
-private[columnar] class ShortColumnStats extends ColumnStats {
+private[columnar] final class ShortColumnStats extends ColumnStats {
   protected var upper = Short.MinValue
   protected var lower = Short.MaxValue
 
   override def gatherStats(row: InternalRow, ordinal: Int): Unit = {
-super.gatherStats(row, ordinal)
 if (!row.isNullAt(ordinal)) {
   val value = row.getShort(ordinal)
-  if (value > upper) upper = value
-  if (value < lower) lower = value
-  sizeInBytes += SHORT.defaultSize
+  gatherValueStats(va

[GitHub] spark pull request #18002: [SPARK-20770][SQL] Improve ColumnStats

2017-05-17 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18002#discussion_r117168074
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
 ---
@@ -53,219 +53,299 @@ private[columnar] sealed trait ColumnStats extends 
Serializable {
   /**
* Gathers statistics information from `row(ordinal)`.
*/
-  def gatherStats(row: InternalRow, ordinal: Int): Unit = {
-if (row.isNullAt(ordinal)) {
-  nullCount += 1
-  // 4 bytes for null position
-  sizeInBytes += 4
-}
+  def gatherStats(row: InternalRow, ordinal: Int): Unit
+
+  /**
+   * Gathers statistics information on `null`.
+   */
+  def gatherNullStats(): Unit = {
+nullCount += 1
+// 4 bytes for null position
+sizeInBytes += 4
 count += 1
   }
 
   /**
-   * Column statistics represented as a single row, currently including 
closed lower bound, closed
+   * Column statistics represented as an array, currently including closed 
lower bound, closed
* upper bound and null count.
*/
-  def collectedStatistics: GenericInternalRow
+  def collectedStatistics: Array[Any]
 }
 
 /**
  * A no-op ColumnStats only used for testing purposes.
  */
-private[columnar] class NoopColumnStats extends ColumnStats {
-  override def gatherStats(row: InternalRow, ordinal: Int): Unit = 
super.gatherStats(row, ordinal)
+private[columnar] final class NoopColumnStats extends ColumnStats {
+  override def gatherStats(row: InternalRow, ordinal: Int): Unit = {
+if (!row.isNullAt(ordinal)) {
+  count += 1
+} else {
+  gatherNullStats
+}
+  }
 
-  override def collectedStatistics: GenericInternalRow =
-new GenericInternalRow(Array[Any](null, null, nullCount, count, 0L))
+  override def collectedStatistics: Array[Any] = Array[Any](null, null, 
nullCount, count, 0L)
 }
 
-private[columnar] class BooleanColumnStats extends ColumnStats {
+private[columnar] final class BooleanColumnStats extends ColumnStats {
   protected var upper = false
   protected var lower = true
 
   override def gatherStats(row: InternalRow, ordinal: Int): Unit = {
-super.gatherStats(row, ordinal)
 if (!row.isNullAt(ordinal)) {
   val value = row.getBoolean(ordinal)
-  if (value > upper) upper = value
-  if (value < lower) lower = value
-  sizeInBytes += BOOLEAN.defaultSize
+  gatherValueStats(value)
+} else {
+  gatherNullStats
 }
   }
 
-  override def collectedStatistics: GenericInternalRow =
-new GenericInternalRow(Array[Any](lower, upper, nullCount, count, 
sizeInBytes))
+  def gatherValueStats(value: Boolean): Unit = {
+if (value > upper) upper = value
+if (value < lower) lower = value
+sizeInBytes += BOOLEAN.defaultSize
+count += 1
+  }
+
+  override def collectedStatistics: Array[Any] =
+Array[Any](lower, upper, nullCount, count, sizeInBytes)
 }
 
-private[columnar] class ByteColumnStats extends ColumnStats {
+private[columnar] final class ByteColumnStats extends ColumnStats {
   protected var upper = Byte.MinValue
   protected var lower = Byte.MaxValue
 
   override def gatherStats(row: InternalRow, ordinal: Int): Unit = {
-super.gatherStats(row, ordinal)
 if (!row.isNullAt(ordinal)) {
   val value = row.getByte(ordinal)
-  if (value > upper) upper = value
-  if (value < lower) lower = value
-  sizeInBytes += BYTE.defaultSize
+  gatherValueStats(value)
+} else {
+  gatherNullStats
 }
   }
 
-  override def collectedStatistics: GenericInternalRow =
-new GenericInternalRow(Array[Any](lower, upper, nullCount, count, 
sizeInBytes))
+  def gatherValueStats(value: Byte): Unit = {
+if (value > upper) upper = value
+if (value < lower) lower = value
+sizeInBytes += BYTE.defaultSize
+count += 1
+  }
+
+  override def collectedStatistics: Array[Any] =
+Array[Any](lower, upper, nullCount, count, sizeInBytes)
 }
 
-private[columnar] class ShortColumnStats extends ColumnStats {
+private[columnar] final class ShortColumnStats extends ColumnStats {
   protected var upper = Short.MinValue
   protected var lower = Short.MaxValue
 
   override def gatherStats(row: InternalRow, ordinal: Int): Unit = {
-super.gatherStats(row, ordinal)
 if (!row.isNullAt(ordinal)) {
   val value = row.getShort(ordinal)
-  if (value > upper) upper = value
-  if (value < lower) lower = value
-  sizeInBytes += SHORT.defaultSize
+  gatherValueStats(va

1 2 3 4 >

1 - 100 of 341 matches

Mail list logo