GitHub user rekhajoshm opened a pull request:
https://github.com/apache/spark/pull/11837
[SPARK-14006] [SparkR] R style changes
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14006
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rekhajoshm/spark brnach-1.6
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11837.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11837
----
commit 941b270b706d3b4aea73dbf102cfb6eee0beff63
Author: Dongjoon Hyun <[email protected]>
Date: 2016-03-03T22:42:12Z
[MINOR] Fix typos in comments and testcase name of code
## What changes were proposed in this pull request?
This PR fixes typos in comments and testcase name of code.
## How was this patch tested?
manual.
Author: Dongjoon Hyun <[email protected]>
Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
commit 3edcc40223c8af12f64c2286420dc77817ab770e
Author: Andrew Or <[email protected]>
Date: 2016-03-03T23:24:38Z
[SPARK-13632][SQL] Move commands.scala to command package
## What changes were proposed in this pull request?
This patch simply moves things to a new package in an effort to reduce the
size of the diff in #11048. Currently the new package only has one file, but in
the future we'll add many new commands in SPARK-13139.
## How was this patch tested?
Jenkins.
Author: Andrew Or <[email protected]>
Closes #11482 from andrewor14/commands-package.
commit ad0de99f3d3167990d501297f1df069fe15e0678
Author: Shixiong Zhu <[email protected]>
Date: 2016-03-03T23:41:56Z
[SPARK-13584][SQL][TESTS] Make ContinuousQueryManagerSuite not output logs
to the console
## What changes were proposed in this pull request?
Make ContinuousQueryManagerSuite not output logs to the console. The logs
will still output to `unit-tests.log`.
I also updated `SQLListenerMemoryLeakSuite` to use `quietly` to avoid
changing the log level which won't output logs to `unit-tests.log`.
## How was this patch tested?
Just check Jenkins output.
Author: Shixiong Zhu <[email protected]>
Closes #11439 from zsxwing/quietly-ContinuousQueryManagerSuite.
commit b373a888621ba6f0dd499f47093d4e2e42086dfc
Author: Davies Liu <[email protected]>
Date: 2016-03-04T01:36:48Z
[SPARK-13415][SQL] Visualize subquery in SQL web UI
## What changes were proposed in this pull request?
This PR support visualization for subquery in SQL web UI, also improve the
explain of subquery, especially when it's used together with whole stage
codegen.
For example:
```python
>>> sqlContext.range(100).registerTempTable("range")
>>> sqlContext.sql("select id / (select sum(id) from range) from range
where id > (select id from range limit 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias(('id / subquery#9), None)]
: +- 'SubqueryAlias subquery#9
: +- 'Project [unresolvedalias('sum('id), None)]
: +- 'UnresolvedRelation `range`, None
+- 'Filter ('id > subquery#8)
: +- 'SubqueryAlias subquery#8
: +- 'GlobalLimit 1
: +- 'LocalLimit 1
: +- 'Project [unresolvedalias('id, None)]
: +- 'UnresolvedRelation `range`, None
+- 'UnresolvedRelation `range`, None
== Analyzed Logical Plan ==
(id / scalarsubquery()): double
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id /
scalarsubquery())#11]
: +- SubqueryAlias subquery#9
: +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS
sum(id)#10L]
: +- SubqueryAlias range
: +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
: +- SubqueryAlias subquery#8
: +- GlobalLimit 1
: +- LocalLimit 1
: +- Project [id#0L]
: +- SubqueryAlias range
: +- Range 0, 100, 1, 4, [id#0L]
+- SubqueryAlias range
+- Range 0, 100, 1, 4, [id#0L]
== Optimized Logical Plan ==
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id /
scalarsubquery())#11]
: +- SubqueryAlias subquery#9
: +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS
sum(id)#10L]
: +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
: +- SubqueryAlias subquery#8
: +- GlobalLimit 1
: +- LocalLimit 1
: +- Project [id#0L]
: +- Range 0, 100, 1, 4, [id#0L]
+- Range 0, 100, 1, 4, [id#0L]
== Physical Plan ==
WholeStageCodegen
: +- Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id
/ scalarsubquery())#11]
: : +- Subquery subquery#9
: : +- WholeStageCodegen
: : : +- TungstenAggregate(key=[],
functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[sum(id)#10L])
: : : +- INPUT
: : +- Exchange SinglePartition, None
: : +- WholeStageCodegen
: : : +- TungstenAggregate(key=[],
functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[sum#14L])
: : : +- Range 0, 1, 4, 100, [id#0L]
: +- Filter (id#0L > subquery#8)
: : +- Subquery subquery#8
: : +- CollectLimit 1
: : +- WholeStageCodegen
: : : +- Project [id#0L]
: : : +- Range 0, 1, 4, 100, [id#0L]
: +- Range 0, 1, 4, 100, [id#0L]
```
The web UI looks like:

This PR also change the tree structure of WholeStageCodegen to make it
consistent than others. Before this change, Both WholeStageCodegen and
InputAdapter hold a references to the same plans, those could be updated
without notify another, causing problems, this is discovered by #11403 .
## How was this patch tested?
Existing tests, also manual tests with the example query, check the explain
and web UI.
Author: Davies Liu <[email protected]>
Closes #11417 from davies/viz_subquery.
commit d062587dd2c4ed13998ee8bcc9d08f29734df228
Author: Davies Liu <[email protected]>
Date: 2016-03-04T01:46:28Z
[SPARK-13601] [TESTS] use 1 partition in tests to avoid race conditions
## What changes were proposed in this pull request?
Fix race conditions when cleanup files.
## How was this patch tested?
Existing tests.
Author: Davies Liu <[email protected]>
Closes #11507 from davies/flaky.
commit 15d57f9c23145ace37d1631d8f9c19675c142214
Author: Wenchen Fan <[email protected]>
Date: 2016-03-04T04:16:37Z
[SPARK-13647] [SQL] also check if numeric value is within allowed range in
_verify_type
## What changes were proposed in this pull request?
This PR makes the `_verify_type` in `types.py` more strict, also check if
numeric value is within allowed range.
## How was this patch tested?
newly added doc test.
Author: Wenchen Fan <[email protected]>
Closes #11492 from cloud-fan/py-verify.
commit f6ac7c30d48e666618466e825578fa457e2a0ed4
Author: thomastechs <[email protected]>
Date: 2016-03-04T04:35:40Z
[SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map
string datatypes to Oracle VARCHAR datatype mapping
## What changes were proposed in this pull request?
A test suite added for the bug fix -SPARK 12941; for the mapping of the
StringType to corresponding in Oracle
## How was this patch tested?
manual tests done
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Author: thomastechs <[email protected]>
Author: THOMAS SEBASTIAN <[email protected]>
Closes #11489 from thomastechs/thomastechs-12941-master-new.
commit 465c665db1dc65e3b02c584cf7f8d06b24909b0c
Author: Shixiong Zhu <[email protected]>
Date: 2016-03-04T06:53:07Z
[SPARK-13652][CORE] Copy ByteBuffer in sendRpcSync as it will be recycled
## What changes were proposed in this pull request?
`sendRpcSync` should copy the response content because the underlying
buffer will be recycled and reused.
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <[email protected]>
Closes #11499 from zsxwing/SPARK-13652.
commit dd83c209f1692a2e5afb72fa7a2d039fd1e682c8
Author: Davies Liu <[email protected]>
Date: 2016-03-04T08:18:15Z
[SPARK-13603][SQL] support SQL generation for subquery
## What changes were proposed in this pull request?
This is support SQL generation for subquery expressions, which will be
replaced to a SubqueryHolder inside SQLBuilder recursively.
## How was this patch tested?
Added unit tests.
Author: Davies Liu <[email protected]>
Closes #11453 from davies/sql_subquery.
commit 27e88faa058c1364d0e99fffc0c5cb64ef817bd3
Author: Abou Haydar Elias <[email protected]>
Date: 2016-03-04T10:01:52Z
[SPARK-13646][MLLIB] QuantileDiscretizer counts dataset twice in getâ¦
## What changes were proposed in this pull request?
It avoids counting the dataframe twice.
Author: Abou Haydar Elias <[email protected]>
Author: Elie A <[email protected]>
Closes #11491 from eliasah/quantile-discretizer-patch.
commit c04dc27cedd3d75781fda4c24da16b6ada44d3e4
Author: Holden Karau <[email protected]>
Date: 2016-03-04T10:56:58Z
[SPARK-13398][STREAMING] Move away from thread pool task support to forkjoin
## What changes were proposed in this pull request?
Remove old deprecated ThreadPoolExecutor and replace with ExecutionContext
using a ForkJoinPool. The downside of this is that scala's ForkJoinPool doesn't
give us a way to specify the thread pool name (and is a wrapper of Java's in
2.12) except by providing a custom factory. Note that we can't use Java's
ForkJoinPool directly in Scala 2.11 since it uses a ExecutionContext which
reports system parallelism. One other implicit change that happens is the old
ExecutionContext would have reported a different default parallelism since it
used system parallelism rather than threadpool parallelism (this was likely not
intended but also likely not a huge difference).
The previous version of this PR attempted to use an execution context
constructed on the ThreadPool (but not the deprecated ThreadPoolExecutor class)
so as to keep the ability to have human readable named threads but this
reported system parallelism.
## How was this patch tested?
unit tests: streaming/testOnly org.apache.spark.streaming.util.*
Author: Holden Karau <[email protected]>
Closes #11423 from
holdenk/SPARK-13398-move-away-from-ThreadPoolTaskSupport-java-forkjoin.
commit 204b02b56afe358b7f2d403fb6e2b9e8a7122798
Author: Rajesh Balamohan <[email protected]>
Date: 2016-03-04T10:59:40Z
[SPARK-12925] Improve HiveInspectors.unwrap for StringObjectInspector.â¦
Earlier fix did not copy the bytes and it is possible for higher level to
reuse Text object. This was causing issues. Proposed fix now copies the bytes
from Text. This still avoids the expensive encoding/decoding
Author: Rajesh Balamohan <[email protected]>
Closes #11477 from rajeshbalamohan/SPARK-12925.2.
commit e617508244b508b59b4debb35cad3258cddbb9cf
Author: Masayoshi TSUZUKI <[email protected]>
Date: 2016-03-04T13:53:53Z
[SPARK-13673][WINDOWS] Fixed not to pollute environment variables.
## What changes were proposed in this pull request?
This patch fixes the problem that `bin\beeline.cmd` pollutes environment
variables.
The similar problem is reported and fixed in
https://issues.apache.org/jira/browse/SPARK-3943, but `bin\beeline.cmd` seems
to be added later.
## How was this patch tested?
manual tests:
I executed the new `bin\beeline.cmd` and confirmed that %SPARK_HOME%
doesn't remain in the command prompt.
Author: Masayoshi TSUZUKI <[email protected]>
Closes #11516 from tsudukim/feature/SPARK-13673.
commit c8f25459ed4ad6b51a5f11665364cfe0b84f7b3c
Author: Dongjoon Hyun <[email protected]>
Date: 2016-03-04T16:25:41Z
[SPARK-13676] Fix mismatched default values for regParam in
LogisticRegression
## What changes were proposed in this pull request?
The default value of regularization parameter for `LogisticRegression`
algorithm is different in Scala and Python. We should provide the same value.
**Scala**
```
scala> new
org.apache.spark.ml.classification.LogisticRegression().getRegParam
res0: Double = 0.0
```
**Python**
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.1
```
## How was this patch tested?
manual. Check the following in `pyspark`.
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.0
```
Author: Dongjoon Hyun <[email protected]>
Closes #11519 from dongjoon-hyun/SPARK-13676.
commit 83302c3bff13bd7734426c81d9c83bf4beb211c9
Author: Xusen Yin <[email protected]>
Date: 2016-03-04T16:32:24Z
[SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py
Add save/load for feature.py. Meanwhile, add save/load for
`ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in
`VectorSlicer` and `StopWordsRemover`.
In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala
implementation is pending in https://github.com/apache/spark/pull/9884. I'll
add them in this PR if https://github.com/apache/spark/pull/9884 gets merged
first. Or add a follow-up JIRA for `RFormula`.
Author: Xusen Yin <[email protected]>
Closes #11203 from yinxusen/SPARK-13036.
commit b7d41474216787e9cd38c04a15c43d5d02f02f93
Author: Andrew Or <[email protected]>
Date: 2016-03-04T18:32:00Z
[SPARK-13633][SQL] Move things into catalyst.parser package
## What changes were proposed in this pull request?
This patch simply moves things to existing package
`o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in
#11048. This is conceptually the same as a recently merged patch #11482.
## How was this patch tested?
Jenkins.
Author: Andrew Or <[email protected]>
Closes #11506 from andrewor14/parser-package.
commit 5f42c28b119b79c0ea4910c478853d451cd1a967
Author: Alex Bozarth <[email protected]>
Date: 2016-03-04T23:04:09Z
[SPARK-13459][WEB UI] Separate Alive and Dead Executors in Executor Totals
Table
## What changes were proposed in this pull request?
Now that dead executors are shown in the executors table (#10058) the
totals table is updated to include the separate totals for alive and dead
executors as well as the current total, as originally discussed in #10668
## How was this patch tested?
Manually verified by running the Standalone Web UI in the latest Safari and
Firefox ESR
Author: Alex Bozarth <[email protected]>
Closes #11381 from ajbozarth/spark13459.
commit a6e2bd31f52f9e9452e52ab5b846de3dee8b98a7
Author: Nong Li <[email protected]>
Date: 2016-03-04T23:15:48Z
[SPARK-13255] [SQL] Update vectorized reader to directly return
ColumnarBatch instead of InternalRows.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Currently, the parquet reader returns rows one by one which is bad for
performance. This patch
updates the reader to directly return ColumnarBatches. This is only enabled
with whole stage
codegen, which is the only operator currently that is able to consume
ColumnarBatches (instead
of rows). The current implementation is a bit of a hack to get this to work
and we should do
more refactoring of these low level interfaces to make this work better.
## How was this patch tested?
```
Results:
TPCDS: Best/Avg Time(ms) Rate(M/s) Per
Row(ns)
---------------------------------------------------------------------------------
q55 (before) 8897 / 9265 12.9
77.2
q55 5486 / 5753 21.0
47.6
```
Author: Nong Li <[email protected]>
Closes #11435 from nongli/spark-13255.
commit f19228eed89cf8e22a07a7ef7f37a5f6f8a3d455
Author: Jason White <[email protected]>
Date: 2016-03-05T00:04:56Z
[SPARK-12073][STREAMING] backpressure rate controller consumes events
preferentially from laggâ¦
â¦ing partitions
I'm pretty sure this is the reason we couldn't easily recover from an
unbalanced Kafka partition under heavy load when using backpressure.
`maxMessagesPerPartition` calculates an appropriate limit for the message
rate from all partitions, and then divides by the number of partitions to
determine how many messages to retrieve per partition. The problem with this
approach is that when one partition is behind by millions of records (due to
random Kafka issues), but the rate estimator calculates only 100k total
messages can be retrieved, each partition (out of say 32) only retrieves max
100k/32=3125 messages.
This PR (still needing a test) determines a per-partition desired message
count by using the current lag for each partition to preferentially weight the
total message limit among the partitions. In this situation, if each partition
gets 1k messages, but 1 partition starts 1M behind, then the total number of
messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one
partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k
messages, and the other 31 partitions share the remaining 3%.
Assuming all of 100k the messages are retrieved and processed within the
batch window, the rate calculator will increase the number of messages to
retrieve in the next batch, until it reaches a new stable point or the backlog
is finished processed.
We're going to try deploying this internally at Shopify to see if this
resolves our issue.
tdas koeninger holdenk
Author: Jason White <[email protected]>
Closes #10089 from JasonMWhite/rate_controller_offsets.
commit adce5ee721c6a844ff21dfcd8515859458fe611d
Author: gatorsmile <[email protected]>
Date: 2016-03-05T11:25:03Z
[SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping
Sets
#### What changes were proposed in this pull request?
This PR is for supporting SQL generation for cube, rollup and grouping sets.
For example, a query using rollup:
```SQL
SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5
WITH ROLLUP
```
Original logical plan:
```
Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
[(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
(key#17L % cast(5 as bigint))#47L AS _c1#45L,
grouping__id#46 AS _c2#44]
+- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
List(key#17L, value#18, null, 1)],
[key#17L,value#18,(key#17L % cast(5 as
bigint))#47L,grouping__id#46]
+- Project [key#17L,
value#18,
(key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as
bigint))#47L]
+- Subquery t1
+- Relation[key#17L,value#18] ParquetRelation
```
Converted SQL:
```SQL
SELECT count( 1) AS `cnt`,
(`t1`.`key` % CAST(5 AS BIGINT)),
grouping_id() AS `_c2`
FROM `default`.`t1`
GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
```
#### How was the this patch tested?
Added eight test cases in `LogicalPlanToSQLSuite`.
Author: gatorsmile <[email protected]>
Author: xiaoli <[email protected]>
Author: Xiao Li <[email protected]>
Closes #11283 from gatorsmile/groupingSetsToSQL.
commit 8290004d94760c22d6d3ca8dda3003ac8644422f
Author: Shixiong Zhu <[email protected]>
Date: 2016-03-05T23:26:27Z
[SPARK-13693][STREAMING][TESTS] Stop StreamingContext before deleting
checkpoint dir
## What changes were proposed in this pull request?
Stop StreamingContext before deleting checkpoint dir to avoid the race
condition that deleting the checkpoint dir and writing checkpoint happen at the
same time.
The flaky test log is here:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
## How was this patch tested?
unit tests
Author: Shixiong Zhu <[email protected]>
Closes #11531 from zsxwing/SPARK-13693.
commit 8ff88094daa4945e7d718baa7b20703fd8087ab0
Author: Cheng Lian <[email protected]>
Date: 2016-03-06T04:54:04Z
Revert "[SPARK-13616][SQL] Let SQLBuilder convert logical plan without a
project on top of it"
This reverts commit f87ce0504ea0697969ac3e67690c78697b76e94a.
According to discussion in #11466, let's revert PR #11466 for safe.
Author: Cheng Lian <[email protected]>
Closes #11539 from liancheng/revert-pr-11466.
commit ee913e6e2d58dfac20f3f06ff306081bd0e48066
Author: Shixiong Zhu <[email protected]>
Date: 2016-03-06T16:57:01Z
[SPARK-13697] [PYSPARK] Fix the missing module name of
TransformFunctionSerializer.loads
## What changes were proposed in this pull request?
Set the function's module name to `__main__` if it's missing in
`TransformFunctionSerializer.loads`.
## How was this patch tested?
Manually test in the shell.
Before this patch:
```
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.streaming.util import TransformFunction
>>> ssc = StreamingContext(sc, 1)
>>> func = TransformFunction(sc, lambda x: x, sc.serializer)
>>> func.rdd_wrapper(lambda x: x)
TransformFunction(<function <lambda> at 0x106ac8b18>)
>>> bytes =
bytearray(ssc._transformerSerializer.serializer.dumps((func.func,
func.rdd_wrap_func, func.deserializers)))
>>> func2 = ssc._transformerSerializer.loads(bytes)
>>> print(func2.func.__module__)
None
>>> print(func2.rdd_wrap_func.__module__)
None
>>>
```
After this patch:
```
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.streaming.util import TransformFunction
>>> ssc = StreamingContext(sc, 1)
>>> func = TransformFunction(sc, lambda x: x, sc.serializer)
>>> func.rdd_wrapper(lambda x: x)
TransformFunction(<function <lambda> at 0x108bf1b90>)
>>> bytes =
bytearray(ssc._transformerSerializer.serializer.dumps((func.func,
func.rdd_wrap_func, func.deserializers)))
>>> func2 = ssc._transformerSerializer.loads(bytes)
>>> print(func2.func.__module__)
__main__
>>> print(func2.rdd_wrap_func.__module__)
__main__
>>>
```
Author: Shixiong Zhu <[email protected]>
Closes #11535 from zsxwing/loads-module.
commit bc7a3ec290904f2d8802583bb0557bca1b8b01ff
Author: Andrew Or <[email protected]>
Date: 2016-03-07T08:14:40Z
[SPARK-13685][SQL] Rename catalog.Catalog to ExternalCatalog
## What changes were proposed in this pull request?
Today we have `analysis.Catalog` and `catalog.Catalog`. In the future the
former will call the latter. When that happens, if both of them are still
called `Catalog` it will be very confusing. This patch renames the latter
`ExternalCatalog` because it is expected to talk to external systems.
## How was this patch tested?
Jenkins.
Author: Andrew Or <[email protected]>
Closes #11526 from andrewor14/rename-catalog.
commit 4b13896ebf7cecf9d50514a62165b612ee18124a
Author: rmishra <[email protected]>
Date: 2016-03-07T09:55:49Z
[SPARK-13705][DOCS] UpdateStateByKey Operation documentation incorrectly
refers to StatefulNetworkWordCount
## What changes were proposed in this pull request?
The reference to StatefulNetworkWordCount.scala from updateStatesByKey
documentation should be removed, till there is a example for updateStatesByKey.
## How was this patch tested?
Have tested the new documentation with jekyll build.
Author: rmishra <[email protected]>
Closes #11545 from rishitesh/SPARK-13705.
commit 03f57a6c2dd6ffd4038ca9cecbfc221deaf52393
Author: Yury Liavitski <[email protected]>
Date: 2016-03-07T10:54:33Z
Fixing the type of the sentiment happiness value
## What changes were proposed in this pull request?
Added the conversion to int for the 'happiness value' read from the file.
Otherwise, later on line 75 the multiplication will multiply a string by a
number, yielding values like "-2-2" instead of -4.
## How was this patch tested?
Tested manually.
Author: Yury Liavitski <[email protected]>
Author: Yury Liavitski <[email protected]>
Closes #11540 from heliocentrist/fix-sentiment-value-type.
commit d7eac9d7951c19302ed41fe03eaa38394aeb9c1a
Author: Dilip Biswal <[email protected]>
Date: 2016-03-07T17:46:28Z
[SPARK-13651] Generator outputs are not resolved correctly resulting in run
time error
## What changes were proposed in this pull request?
```
Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src")
sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100,
'key2', 200)) t1 AS key, value")
```
Results in following logical plan
```
Project [key#2,value#3]
+- Generate
explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)),
true, false, Some(genoutput), [key#2,value#3]
+- SubqueryAlias src
+- Project [_1#0 AS key#2,_2#1 AS value#3]
+- LocalRelation [_1#0,_2#1], [[id1,value1]]
```
The above query fails with following runtime error.
```
java.lang.ClassCastException: java.lang.Integer cannot be cast to
org.apache.spark.unsafe.types.UTF8String
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
at
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42)
at
org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98)
at
org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
<stack-trace omitted.....>
```
In this case the generated outputs are wrongly resolved from its child
(LocalRelation) due to
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
Added unit tests in hive/SQLQuerySuite and AnalysisSuite
Author: Dilip Biswal <[email protected]>
Closes #11497 from dilipbiswal/spark-13651.
commit 489641117651d11806d2773b7ded7c163d0260e5
Author: Wenchen Fan <[email protected]>
Date: 2016-03-07T18:32:34Z
[SPARK-13694][SQL] QueryPlan.expressions should always include all
expressions
## What changes were proposed in this pull request?
It's weird that expressions don't always have all the expressions in it.
This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it
to exclude some expressions. Currently only `Generate` override it, we can use
`producedAttributes` to fix the unresolved attribute problem for it.
Note that this PR doesn't fix the problem in #11497
## How was this patch tested?
existing tests.
Author: Wenchen Fan <[email protected]>
Closes #11532 from cloud-fan/generate.
commit ef77003178eb5cdcb4fe519fc540917656c5d577
Author: Sameer Agarwal <[email protected]>
Date: 2016-03-07T20:04:59Z
[SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins
based on their data constraints
## What changes were proposed in this pull request?
This PR adds an optimizer rule to eliminate reading (unnecessary) NULL
values if they are not required for correctness by inserting `isNotNull`
filters is the query plan. These filters are currently inserted beneath
existing `Filter` and `Join` operators and are inferred based on their data
constraints.
Note: While this optimization is applicable to all types of join, it
primarily benefits `Inner` and `LeftSemi` joins.
## How was this patch tested?
1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in
the query plan for joins and filters. Also, tests interaction with the
`CombineFilters` optimizer rules.
2. Test generated ExpressionTrees via `OrcFilterSuite`
3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite`
cc yhuai nongli
Author: Sameer Agarwal <[email protected]>
Closes #11372 from sameeragarwal/gen-isnotnull.
commit e72914f37de85519fc2aa131bac69d7582de98c8
Author: Dongjoon Hyun <[email protected]>
Date: 2016-03-07T20:06:46Z
[SPARK-12243][BUILD][PYTHON] PySpark tests are slow in Jenkins.
## What changes were proposed in this pull request?
In the Jenkins pull request builder, PySpark tests take around [962 seconds
](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52530/console)
of end-to-end time to run, despite the fact that we run four Python test
suites in parallel. According to the log, the basic reason is that the long
running test starts at the end due to FIFO queue. We first try to reduce the
test time by just starting some long running tests first with simple priority
queue.
```
========================================================================
Running PySpark tests
========================================================================
...
Finished test(python3.4): pyspark.streaming.tests (213s)
Finished test(pypy): pyspark.sql.tests (92s)
Finished test(pypy): pyspark.streaming.tests (280s)
Tests passed in 962 seconds
```
## How was this patch tested?
Manual check.
Check 'Running PySpark tests' part of the Jenkins log.
Author: Dongjoon Hyun <[email protected]>
Closes #11551 from dongjoon-hyun/SPARK-12243.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]