GitHub user yang0228 opened a pull request:
https://github.com/apache/spark/pull/12408
[SPARK-14622] Retain lost executors status
## What changes were proposed in this pull request?
Retain history info for lost executors in âexecutorsâ dashboard of
spark Web UI.
## How was this patch tested?
Apply unit tests and observe the Web UI with a related screenshot.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12408.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12408
----
commit ac1b8b302a92678bbeece6e9c7879f1cb8fdad12
Author: Nishkam Ravi <[email protected]>
Date: 2016-03-31T19:03:05Z
[SPARK-13796] Redirect error message to logWarning
## What changes were proposed in this pull request?
Redirect error message to logWarning
## How was this patch tested?
Unit tests, manual tests
JoshRosen
Author: Nishkam Ravi <[email protected]>
Closes #12052 from nishkamravi2/master_warning.
commit 446c45bd87035e20653394fcaf9dc8caa4299038
Author: gatorsmile <[email protected]>
Date: 2016-03-31T19:03:55Z
[SPARK-14182][SQL] Parse DDL Command: Alter View
This PR is to provide native parsing support for DDL commands: `Alter
View`. Since its AST trees are highly similar to `Alter Table`. Thus, both
implementation are integrated into the same one.
Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL and
https://cwiki.apache.org/confluence/display/Hive/PartitionedViews
**Syntax:**
```SQL
ALTER VIEW view_name RENAME TO new_view_name
```
- to change the name of a view to a different name
**Syntax:**
```SQL
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
```
- to add metadata to a view
**Syntax:**
```SQL
ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
```
- to remove metadata from a view
**Syntax:**
```SQL
ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2,
...]
```
- to add the partitioning metadata for a view.
- the syntax of partition spec in `ALTER VIEW` is identical to `ALTER
TABLE`, **EXCEPT** that it is **ILLEGAL** to specify a `LOCATION` clause.
**Syntax:**
```SQL
ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2,
...]
```
- to drop the related partition metadata for a view.
Added the related test cases to `DDLCommandSuite`
Author: gatorsmile <[email protected]>
Author: xiaoli <[email protected]>
Author: Xiao Li <[email protected]>
Closes #11987 from gatorsmile/parseAlterView.
commit 8a333d2da859fd593bda183413630bc3757529c9
Author: jeanlyn <[email protected]>
Date: 2016-03-31T19:04:42Z
[SPARK-14243][CORE] update task metrics when removing blocks
## What changes were proposed in this pull request?
This PR try to use `incUpdatedBlockStatuses ` to update the
`updatedBlockStatuses ` when removing blocks, making sure `BlockManager`
correctly updates `updatedBlockStatuses`
## How was this patch tested?
test("updated block statuses") in BlockManagerSuite.scala
Author: jeanlyn <[email protected]>
Closes #12091 from jeanlyn/updateBlock.
commit 4d93b653f7294698526674950d3dc303691260f8
Author: Michael Gummelt <[email protected]>
Date: 2016-03-31T19:06:16Z
[Docs] Update monitoring.md to accurately describe the history server
It looks like the docs were recently updated to reflect the History
Server's support for incomplete applications, but they still had wording that
suggested only completed applications were viewable. This fixes that.
My editor also introduced several whitespace removal changes, that I hope
are OK, as text files shouldn't have trailing whitespace. To verify they're
purely whitespace changes, add `&w=1` to your browser address. If this isn't
acceptable, let me know and I'll update the PR.
I also didn't think this required a JIRA. Let me know if I should create
one.
Not tested
Author: Michael Gummelt <[email protected]>
Closes #12045 from mgummelt/update-history-docs.
commit 0abee534f0ad9bbe84d8d3d3478ecaa594f1e0f4
Author: Wenchen Fan <[email protected]>
Date: 2016-03-31T19:07:19Z
[SPARK-14069][SQL] Improve SparkStatusTracker to also track executor
information
## What changes were proposed in this pull request?
Track executor information like host and port, cache size, running tasks.
TODO: tests
## How was this patch tested?
N/A
Author: Wenchen Fan <[email protected]>
Closes #11888 from cloud-fan/status-tracker.
commit 10508f36adcb74a563010636dffcd1f68efd8468
Author: Jo Voordeckers <[email protected]>
Date: 2016-03-31T19:08:10Z
[SPARK-11327][MESOS] Dispatcher does not respect all args from the Submit
request
Supersedes https://github.com/apache/spark/pull/9752
Author: Jo Voordeckers <[email protected]>
Author: Iulian Dragos <[email protected]>
Closes #10370 from jayv/mesos_cluster_params.
commit 3cfbeb70b1feb1f3a8c4d0b2d2f3715a356c80f2
Author: Michel Lemay <[email protected]>
Date: 2016-03-31T19:15:32Z
[SPARK-13710][SHELL][WINDOWS] Fix jline dependency on Windows
## What changes were proposed in this pull request?
Exclude jline from curator-recipes since it conflicts with scala 2.11 when
running spark-shell. Should not affect scala 2.10 since it is builtin.
## How was this patch tested?
Ran spark-shell manually.
Author: Michel Lemay <[email protected]>
Closes #12043 from michellemay/spark-13710-fix-jline-on-windows.
commit e785402826dcd984d9312470464714ba6c908a49
Author: Shixiong Zhu <[email protected]>
Date: 2016-03-31T19:17:25Z
[SPARK-14304][SQL][TESTS] Fix tests that don't create temp files in the
`java.io.tmpdir` folder
## What changes were proposed in this pull request?
If I press `CTRL-C` when running these tests, the temp files will be left
in `sql/core` folder and I need to delete them manually. It's annoying. This PR
just moves the temp files to the `java.io.tmpdir` folder and add a name prefix
for them.
## How was this patch tested?
Existing Jenkins tests
Author: Shixiong Zhu <[email protected]>
Closes #12093 from zsxwing/temp-file.
commit b11887c086974dbab18b9f53e99a26bbe06e9c86
Author: sethah <[email protected]>
Date: 2016-03-31T20:00:10Z
[SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark
## What changes were proposed in this pull request?
Feature importances are exposed in the python API for GBTs.
Other changes:
* Update the random forest feature importance documentation to not repeat
decision tree docstring and instead place a reference to it.
## How was this patch tested?
Python doc tests were updated to validate GBT feature importance.
Author: sethah <[email protected]>
Closes #12056 from sethah/Pyspark_GBT_feature_importance.
commit a7af6cd2eaf9f6ff491b9e1fabfc9c6f3d0f54bf
Author: Josh Rosen <[email protected]>
Date: 2016-03-31T20:52:59Z
[SPARK-14281][TESTS] Fix java8-tests and simplify their build
This patch fixes a compilation / build break in Spark's `java8-tests` and
refactors their POM to simplify the build. See individual commit messages for
more details.
Author: Josh Rosen <[email protected]>
Closes #12073 from JoshRosen/fix-java8-tests.
commit 8de201baedc8e839e06098c536ba31b3dafd54b5
Author: Sital Kedia <[email protected]>
Date: 2016-03-31T23:06:44Z
[SPARK-14277][CORE] Upgrade Snappy Java to 1.1.2.4
## What changes were proposed in this pull request?
Upgrade snappy to 1.1.2.4 to improve snappy read/write performance.
## How was this patch tested?
Tested by running a job on the cluster and saw 7.5% cpu savings after this
change.
Author: Sital Kedia <[email protected]>
Closes #12096 from sitalkedia/snappyRelease.
commit f0afafdc5dfee80d7e5cd2fc1fa8187def7f262d
Author: Davies Liu <[email protected]>
Date: 2016-03-31T23:40:20Z
[SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single
batch
## What changes were proposed in this pull request?
This PR support multiple Python UDFs within single batch, also improve the
performance.
```python
>>> from pyspark.sql.types import IntegerType
>>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
>>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
>>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2),
1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias('double('add(1, 2)),
None),unresolvedalias('add('double(2), 1), None)]
+- OneRowRelation$
== Analyzed Logical Plan ==
double(add(1, 2)): int, add(double(2), 1): int
Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS
add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
+- EvaluatePython [double(add(1, 2)),double(2)],
[pythonUDF0#16,pythonUDF1#17]
+- OneRowRelation$
== Optimized Logical Plan ==
Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS
add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
+- EvaluatePython [double(add(1, 2)),double(2)],
[pythonUDF0#16,pythonUDF1#17]
+- OneRowRelation$
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS
add(double(2), 1)#15]
: +- INPUT
+- !BatchPythonEvaluation [add(pythonUDF1#17, 1)],
[pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
+- !BatchPythonEvaluation [double(add(1, 2)),double(2)],
[pythonUDF0#16,pythonUDF1#17]
+- Scan OneRowRelation[]
```
## How was this patch tested?
Added new tests.
Using the following script to benchmark 1, 2 and 3 udfs,
```
df = sqlContext.range(1, 1 << 23, 1, 4)
double = F.udf(lambda x: x * 2, LongType())
print df.select(double(df.id)).count()
print df.select(double(df.id), double(df.id + 1)).count()
print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
```
Here is the results:
N | Before | After | speed up
---- |------------ | -------------|------
1 | 22 s | 7 s | 3.1X
2 | 38 s | 13 s | 2.9X
3 | 58 s | 16 s | 3.6X
This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python
before before this patch, 4 process after this patch. After this patch, it will
use less memory for multiple UDFs than before (less buffering).
Author: Davies Liu <[email protected]>
Closes #12057 from davies/multi_udfs.
commit 96941b12f8b465df21423275f3cd3ade579b4fa1
Author: Zhang, Liye <[email protected]>
Date: 2016-04-01T03:17:52Z
[SPARK-14242][CORE][NETWORK] avoid copy in compositeBuffer for frame decoder
## What changes were proposed in this pull request?
In this patch, we set the initial `maxNumComponents` to `Integer.MAX_VALUE`
instead of the default size ( which is 16) when allocating `compositeBuffer` in
`TransportFrameDecoder` because `compositeBuffer` will introduce too many
memory copies underlying if `compositeBuffer` is with default
`maxNumComponents` when the frame size is large (which result in many transport
messages). For details, please refer to
[SPARK-14242](https://issues.apache.org/jira/browse/SPARK-14242).
## How was this patch tested?
spark unit tests and manual tests.
For manual tests, we can reproduce the performance issue with following
code:
`sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new
Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length`
It's easy to see the performance gain, both from the running time and CPU
usage.
Author: Zhang, Liye <[email protected]>
Closes #12038 from liyezhang556520/spark-14242.
commit 1b070637fa03ab4966f76427b15e433050eaa956
Author: Cheng Lian <[email protected]>
Date: 2016-04-01T06:46:08Z
[SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM
## What changes were proposed in this pull request?
This PR implements `FileFormat.buildReader()` for the LibSVM data source.
Besides that, a new interface method `prepareRead()` is added to `FileFormat`:
```scala
def prepareRead(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Map[String, String] = options
```
After migrating from `buildInternalScan()` to `buildReader()`, we lost the
opportunity to collect necessary global information, since `buildReader()`
works in a per-partition manner. For example, LibSVM needs to infer the total
number of features if the `numFeatures` data source option is not set. Any
necessary collected global information should be returned using the data source
options map. By default, this method just returns the original options
untouched.
An alternative approach is to absorb `inferSchema()` into `prepareRead()`,
since schema inference is also some kind of global information gathering.
However, this approach wasn't chosen because schema inference is optional,
while `prepareRead()` must be called whenever a `HadoopFsRelation` based data
source relation is instantiated.
One unaddressed problem is that, when `numFeatures` is absent, now the
input data will be scanned twice. The `buildInternalScan()` code path doesn't
need to do this because it caches the raw parsed RDD in memory before computing
the total number of features. However, with `FileScanRDD`, the raw parsed RDD
is created in a different way (e.g. partitioning) from the final RDD.
## How was this patch tested?
Tested using existing test suites.
Author: Cheng Lian <[email protected]>
Closes #12088 from liancheng/spark-14295-libsvm-build-reader.
commit 26867ebc67edab97376c5d8fee76df294359e461
Author: Alexander Ulanov <[email protected]>
Date: 2016-04-01T06:48:36Z
[SPARK-11262][ML] Unit test for gradient, loss layers, memory management
for multilayer perceptron
1.Implement LossFunction trait and implement squared error and cross entropy
loss with it
2.Implement unit test for gradient and loss
3.Implement InPlace trait and in-place layer evaluation
4.Refactor interface for ActivationFunction
5.Update of Layer and LayerModel interfaces
6.Fix random weights assignment
7.Implement memory allocation by MLP model instead of individual layers
These features decreased the memory usage and increased flexibility of
internal API.
Author: Alexander Ulanov <[email protected]>
Author: avulanov <[email protected]>
Closes #9229 from avulanov/mlp-refactoring.
commit 22249afb4a932a82ff1f7a3befea9fda5a60a3f4
Author: Yanbo Liang <[email protected]>
Date: 2016-04-01T06:49:58Z
[SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans
## What changes were proposed in this pull request?
Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the
code refactor for the original ```KMeans``` wrapper.
## How was this patch tested?
Existing tests.
cc mengxr
Author: Yanbo Liang <[email protected]>
Closes #12039 from yanboliang/spark-14059.
commit 3715ecdf417b47423ff07145a5623d8d817c45ef
Author: Cheng Lian <[email protected]>
Date: 2016-04-01T09:02:48Z
[SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failure
## What changes were proposed in this pull request?
Fixes a compilation failure introduced in PR #12088 under Scala 2.10.
## How was this patch tested?
Compilation.
Author: Cheng Lian <[email protected]>
Closes #12107 from liancheng/spark-14295-hotfix.
commit 0b04f8fdf1614308cb3e7e0c7282f7365cc3d1bb
Author: Dilip Biswal <[email protected]>
Date: 2016-04-01T16:27:11Z
[SPARK-14184][SQL] Support native execution of SHOW DATABASE command and
fix SHOW TABLE to use table identifier pattern
## What changes were proposed in this pull request?
This PR addresses the following
1. Supports native execution of SHOW DATABASES command
2. Fixes SHOW TABLES to apply the identifier_with_wildcards pattern if
supplied.
SHOW TABLE syntax
```
SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
```
SHOW DATABASES syntax
```
SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];
```
## How was this patch tested?
Tests added in SQLQuerySuite (both hive and sql contexts) and
DDLCommandSuite
Note: Since the table name pattern was not working , tests are added in
both SQLQuerySuite to
verify the application of the table pattern.
Author: Dilip Biswal <[email protected]>
Closes #11991 from dilipbiswal/dkb_show_database.
commit a471c7f9eaa59d55dfff5b9d1a858f304a6b3a84
Author: sureshthalamati <[email protected]>
Date: 2016-04-01T16:33:31Z
[SPARK-14133][SQL] Throws exception for unsupported create/drop/alter index
, and lock/unlock operations.
## What changes were proposed in this pull request?
This PR throws Unsupported Operation exception for create index, drop
index, alter index , lock table , lock database, unlock table, and unlock
database operations that are not supported in Spark SQL. Currently these
operations are executed executed by Hive.
Error:
spark-sql> drop index my_index on my_table;
Error in query:
Unsupported operation: drop index(line 1, pos 0)
## How was this patch tested?
Added test cases to HiveQuerySuite
yhuai hvanhovell andrewor14
Author: sureshthalamati <[email protected]>
Closes #12069 from sureshthalamati/unsupported_ddl_spark-14133.
commit 58e6bc827f1f9dc1afee07dca1bee1f56553dd20
Author: Dongjoon Hyun <[email protected]>
Date: 2016-04-01T17:36:01Z
[MINOR] [SQL] Update usage of `debug` by removing `typeCheck` and adding
`debugCodegen`
## What changes were proposed in this pull request?
This PR updates the usage comments of `debug` according to the following
commits.
- [SPARK-9754](https://issues.apache.org/jira/browse/SPARK-9754) removed
`typeCheck`.
- [SPARK-14227](https://issues.apache.org/jira/browse/SPARK-14227) added
`debugCodegen`.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <[email protected]>
Closes #12094 from dongjoon-hyun/minor_fix_debug_usage.
commit 8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e
Author: jerryshao <[email protected]>
Date: 2016-04-01T17:52:13Z
[SPARK-12343][YARN] Simplify Yarn client and client argument
## What changes were proposed in this pull request?
Currently in Spark on YARN, configurations can be passed through SparkConf,
env and command arguments, some parts are duplicated, like client argument and
SparkConf. So here propose to simplify the command arguments.
## How was this patch tested?
This patch is tested manually with unit test.
CC vanzin tgravescs , please help to suggest this proposal. The original
purpose of this JIRA is to remove `ClientArguments`, through refactoring some
arguments like `--class`, `--arg` are not so easy to replace, so here I remove
the most part of command line arguments, only keep the minimal set.
Author: jerryshao <[email protected]>
Closes #11603 from jerryshao/SPARK-12343.
commit 381358fbe9afbe205299cbbea4c43148e2e69468
Author: Yanbo Liang <[email protected]>
Date: 2016-04-01T19:53:39Z
[SPARK-14305][ML][PYSPARK] PySpark ml.clustering BisectingKMeans support
export/import
## What changes were proposed in this pull request?
PySpark ml.clustering BisectingKMeans support export/import
## How was this patch tested?
doc test.
cc jkbradley
Author: Yanbo Liang <[email protected]>
Closes #12112 from yanboliang/spark-14305.
commit df68beb85de59bb6d35b2a8a3b85dbc447798bf5
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-04-01T20:00:55Z
[SPARK-13995][SQL] Extract correct IsNotNull constraints for Expression
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-13995
We infer relative `IsNotNull` constraints from logical plan's expressions
in `constructIsNotNullConstraints` now. However, we don't consider the case of
(nested) `Cast`.
For example:
val tr = LocalRelation('a.int, 'b.long)
val plan = tr.where('a.attr === 'b.attr).analyze
Then, the plan's constraints will have `IsNotNull(Cast(resolveColumn(tr,
"a"), LongType))`, instead of `IsNotNull(resolveColumn(tr, "a"))`. This PR
fixes it.
Besides, as `IsNotNull` constraints are most useful for `Attribute`, we
should do recursing through any `Expression` that is null intolerant and
construct `IsNotNull` constraints for all `Attribute`s under these Expressions.
For example, consider the following constraints:
val df = Seq((1,2,3)).toDF("a", "b", "c")
df.where("a + b = c").queryExecution.analyzed.constraints
The inferred isnotnull constraints should be isnotnull(a), isnotnull(b),
isnotnull(c), instead of isnotnull(a + c) and isnotnull(c).
## How was this patch tested?
Test is added into `ConstraintPropagationSuite`.
Author: Liang-Chi Hsieh <[email protected]>
Closes #11809 from viirya/constraint-cast.
commit a884daad805a701494e87393dc307937472a985d
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-04-01T20:03:27Z
[SPARK-14191][SQL] Remove invalid Expand operator constraints
`Expand` operator now uses its child plan's constraints as its valid
constraints (i.e., the base of constraints). This is not correct because
`Expand` will set its group by attributes to null values. So the nullability of
these attributes should be true.
E.g., for an `Expand` operator like:
val input = LocalRelation('a.int, 'b.int, 'c.int).where('c.attr > 10 &&
'a.attr < 5 && 'b.attr > 2)
Expand(
Seq(
Seq('c, Literal.create(null, StringType), 1),
Seq('c, 'a, 2)),
Seq('c, 'a, 'gid.int),
Project(Seq('a, 'c), input))
The `Project` operator has the constraints `IsNotNull('a)`, `IsNotNull('b)`
and `IsNotNull('c)`. But the `Expand` should not have `IsNotNull('a)` in its
constraints.
This PR is the first step for this issue and remove invalid constraints of
`Expand` operator.
A test is added to `ConstraintPropagationSuite`.
Author: Liang-Chi Hsieh <[email protected]>
Author: Michael Armbrust <[email protected]>
Closes #11995 from viirya/fix-expand-constraints.
commit 1e886159849e3918445d3fdc3c4cef86c6c1a236
Author: Tejas Patil <[email protected]>
Date: 2016-04-01T20:13:16Z
[SPARK-14070][SQL] Use ORC data source for SQL queries on ORC tables
## What changes were proposed in this pull request?
This patch enables use of OrcRelation for SQL queries which read data from
Hive tables. Changes in this patch:
- Added a new rule `OrcConversions` which would alter the plan to use
`OrcRelation`. In this diff, the conversion is done only for reads.
- Added a new config `spark.sql.hive.convertMetastoreOrc` to control the
conversion
BEFORE
```
scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None
== Analyzed Logical Plan ==
key: string, value: string
Project [key#171,value#172]
+- MetastoreRelation default, orc_table, None
== Optimized Logical Plan ==
MetastoreRelation default, orc_table, None
== Physical Plan ==
HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table,
None
```
AFTER
```
scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None
== Analyzed Logical Plan ==
key: string, value: string
Project [key#76,value#77]
+- SubqueryAlias orc_table
+- Relation[key#76,value#77] ORC part: struct<>, data:
struct<key:string,value:string>
== Optimized Logical Plan ==
Relation[key#76,value#77] ORC part: struct<>, data:
struct<key:string,value:string>
== Physical Plan ==
WholeStageCodegen
: +- Scan ORC part: struct<>, data:
struct<key:string,value:string>[key#76,value#77] InputPaths:
file:/user/hive/warehouse/orc_table
```
## How was this patch tested?
- Added a new unit test. Ran existing unit tests
- Ran with production like data
## Performance gains
Ran on a production table in Facebook (note that the data was in DWRF file
format which is similar to ORC)
Best case : when there was no matching rows for the predicate in the query
(everything is filtered out)
```
CPU time Wall time Total wall time
across all tasks
================================================================
Without the change 541_515 sec 25.0 mins 165.8 hours
With change 407 sec 1.5 mins 15 mins
```
Average case: A subset of rows in the data match the query predicate
```
CPU time Wall time Total wall time
across all tasks
================================================================
Without the change 624_630 sec 31.0 mins 199.0 h
With change 14_769 sec 5.3 mins 7.7 h
```
Author: Tejas Patil <[email protected]>
Closes #11891 from tejasapatil/orc_ppd.
commit 1b829ce13990b40fd8d7c9efcc2ae55c4dbc861c
Author: Burak Yavuz <[email protected]>
Date: 2016-04-01T20:19:24Z
[SPARK-14160] Time Windowing functions for Datasets
## What changes were proposed in this pull request?
This PR adds the function `window` as a column expression.
`window` can be used to bucket rows into time windows given a time column.
With this expression, performing time series analysis on batch data, as well as
streaming data should become much more simpler.
### Usage
Assume the following schema:
`sensor_id, measurement, timestamp`
To average 5 minute data every 1 minute (window length of 5 minutes, slide
duration of 1 minute), we will use:
```scala
df.groupBy(window("timestamp", â5 minutesâ, â1 minuteâ),
"sensor_id")
.agg(mean("measurement").as("avg_meas"))
```
This will generate windows such as:
```
09:00:00-09:05:00
09:01:00-09:06:00
09:02:00-09:07:00 ...
```
Intervals will start at every `slideDuration` starting at the unix epoch
(1970-01-01 00:00:00 UTC).
To start intervals at a different point of time, e.g. 30 seconds after a
minute, the `startTime` parameter can be used.
```scala
df.groupBy(window("timestamp", â5 minutesâ, â1 minuteâ, "30
second"), "sensor_id")
.agg(mean("measurement").as("avg_meas"))
```
This will generate windows such as:
```
09:00:30-09:05:30
09:01:30-09:06:30
09:02:30-09:07:30 ...
```
Support for Python will be made in a follow up PR after this.
## How was this patch tested?
This patch has some basic unit tests for the `TimeWindow` expression
testing that the parameters pass validation, and it also has some
unit/integration tests testing the correctness of the windowing and usability
in complex operations (multi-column grouping, multi-column projections, joins).
Author: Burak Yavuz <[email protected]>
Author: Michael Armbrust <[email protected]>
Closes #12008 from brkyvz/df-time-window.
commit 3e991dbc310a4a33eec7f3909adce50bf8268d04
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-04-01T21:02:32Z
[SPARK-13674] [SQL] Add wholestage codegen support to Sample
JIRA: https://issues.apache.org/jira/browse/SPARK-13674
## What changes were proposed in this pull request?
Sample operator doesn't support wholestage codegen now. This pr is to add
support to it.
## How was this patch tested?
A test is added into `BenchmarkWholeStageCodegen`. Besides, all tests
should be passed.
Author: Liang-Chi Hsieh <[email protected]>
Author: Liang-Chi Hsieh <[email protected]>
Closes #11517 from viirya/add-wholestage-sample.
commit bd7b91cefb0d192d808778e6182dcdd2c143e132
Author: zhonghaihua <[email protected]>
Date: 2016-04-01T21:23:14Z
[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster
killed for max nâ¦
Currently, when max number of executor failures reached the
`maxNumExecutorFailures`, `ApplicationMaster` will be killed and re-register
another one.This time, `YarnAllocator` will be created a new instance.
But, the value of property `executorIdCounter` in `YarnAllocator` will
reset to `0`. Then the Id of new executor will starting from `1`. This will
confuse with the executor has already created before, which will cause
FetchFailedException.
This situation is just in yarn client mode, so this is an issue in yarn
client mode. For more details, [link to jira issues
SPARK-12864](https://issues.apache.org/jira/browse/SPARK-12864)
This PR introduce a mechanism to initialize `executorIdCounter` after
`ApplicationMaster` killed.
Author: zhonghaihua <[email protected]>
Closes #10794 from zhonghaihua/initExecutorIdCounterAfterAMKilled.
commit e41acb757327e3226ffe312766ec759c16616588
Author: Josh Rosen <[email protected]>
Date: 2016-04-01T21:34:59Z
[SPARK-13992] Add support for off-heap caching
This patch adds support for caching blocks in the executor processes using
direct / off-heap memory.
## User-facing changes
**Updated semantics of `OFF_HEAP` storage level**: In Spark 1.x, the
`OFF_HEAP` storage level indicated that an RDD should be cached in Tachyon.
Spark 2.x removed the external block store API that Tachyon caching was based
on (see #10752 / SPARK-12667), so `OFF_HEAP` became an alias for
`MEMORY_ONLY_SER`. As of this patch, `OFF_HEAP` means "serialized and cached in
off-heap memory or on disk". Via the `StorageLevel` constructor, `useOffHeap`
can be set if `serialized == true` and can be used to construct custom storage
levels which support replication.
**Storage UI reporting**: the storage UI will now report whether in-memory
blocks are stored on- or off-heap.
**Only supported by UnifiedMemoryManager**: for simplicity, this feature is
only supported when the default UnifiedMemoryManager is used; applications
which use the legacy memory manager (`spark.memory.useLegacyMode=true`) are not
currently able to allocate off-heap storage memory, so using off-heap caching
will fail with an error when legacy memory management is enabled. Given that we
plan to eventually remove the legacy memory manager, this is not a significant
restriction.
**Memory management policies:** the policies for dividing available memory
between execution and storage are the same for both on- and off-heap memory.
For off-heap memory, the total amount of memory available for use by Spark is
controlled by `spark.memory.offHeap.size`, which is an absolute size. Off-heap
storage memory obeys `spark.memory.storageFraction` in order to control the
amount of unevictable storage memory. For example, if
`spark.memory.offHeap.size` is 1 gigabyte and Spark uses the default
`storageFraction` of 0.5, then up to 500 megabytes of off-heap cached blocks
will be protected from eviction due to execution memory pressure. If necessary,
we can split `spark.memory.storageFraction` into separate on- and off-heap
configurations, but this doesn't seem necessary now and can be done later
without any breaking changes.
**Use of off-heap memory does not imply use of off-heap execution (or
vice-versa)**: for now, the settings controlling the use of off-heap execution
memory (`spark.memory.offHeap.enabled`) and off-heap caching are completely
independent, so Spark SQL can be configured to use off-heap memory for
execution while continuing to cache blocks on-heap. If desired, we can change
this in a followup patch so that `spark.memory.offHeap.enabled` affect the
default storage level for cached SQL tables.
## Internal changes
- Rename `ByteArrayChunkOutputStream` to `ChunkedByteBufferOutputStream`
- It now returns a `ChunkedByteBuffer` instead of an array of byte arrays.
- Its constructor now accept an `allocator` function which is called to
allocate `ByteBuffer`s. This allows us to control whether it allocates regular
ByteBuffers or off-heap DirectByteBuffers.
- Because block serialization is now performed during the unroll process,
a `ChunkedByteBufferOutputStream` which is configured with a `DirectByteBuffer`
allocator will use off-heap memory for both unroll and storage memory.
- The `MemoryStore`'s MemoryEntries now tracks whether blocks are stored
on- or off-heap.
- `evictBlocksToFreeSpace()` now accepts a `MemoryMode` parameter so that
we don't try to evict off-heap blocks in response to on-heap memory pressure
(or vice-versa).
- Make sure that off-heap buffers are properly de-allocated during
MemoryStore eviction.
- The JVM limits the total size of allocated direct byte buffers using the
`-XX:MaxDirectMemorySize` flag and the default tends to be fairly low (< 512
megabytes in some JVMs). To work around this limitation, this patch adds a
custom DirectByteBuffer allocator which ignores this memory limit.
Author: Josh Rosen <[email protected]>
Closes #11805 from JoshRosen/off-heap-caching.
commit 0b7d4966ca7e02f351c4b92a74789cef4799fcb1
Author: Shixiong Zhu <[email protected]>
Date: 2016-04-01T22:00:38Z
[SPARK-14316][SQL] StateStoreCoordinator should extend ThreadSafeRpcEndpoint
## What changes were proposed in this pull request?
RpcEndpoint is not thread safe and allows multiple messages to be processed
at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <[email protected]>
Closes #12100 from zsxwing/fix-StateStoreCoordinator.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]