GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/16042
Fix of dev scripts and new, Criteo specific, ones WIP
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/AnthonyTruchet/spark dev-tools
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16042.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16042
----
commit 18661a2bb527adbd01e98158696a16f6d8162411
Author: Tommy YU <[email protected]>
Date: 2016-02-12T02:38:49Z
[SPARK-13153][PYSPARK] ML persistence failed when handle no default value
parameter
Fix this defect by check default value exist or not.
yanboliang Please help to review.
Author: Tommy YU <[email protected]>
Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
(cherry picked from commit d3e2e202994e063856c192e9fdd0541777b88e0e)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 93a55f3df3c9527ecf4143cb40ac7212bc3a975a
Author: markpavey <[email protected]>
Date: 2016-02-13T08:39:43Z
[SPARK-13142][WEB UI] Problem accessing Web UI /logPage/ on Microsoft
Windows
Due to being on a Windows platform I have been unable to run the tests as
described in the "Contributing to Spark" instructions. As the change is only to
two lines of code in the Web UI, which I have manually built and tested, I am
submitting this pull request anyway. I hope this is OK.
Is it worth considering also including this fix in any future 1.5.x
releases (if any)?
I confirm this is my own original work and license it to the Spark project
under its open source license.
Author: markpavey <[email protected]>
Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.
(cherry picked from commit 374c4b2869fc50570a68819cf0ece9b43ddeb34b)
Signed-off-by: Sean Owen <[email protected]>
commit 107290c94312524bfc4560ebe0de268be4ca56af
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-02-13T23:56:20Z
[SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed
test
JIRA: https://issues.apache.org/jira/browse/SPARK-12363
This issue is pointed by yanboliang. When `setRuns` is removed from
PowerIterationClustering, one of the tests will be failed. I found that some
`dstAttr`s of the normalized graph are not correct values but 0.0. By setting
`TripletFields.All` in `mapTriplets` it can work.
Author: Liang-Chi Hsieh <[email protected]>
Author: Xiangrui Meng <[email protected]>
Closes #10539 from viirya/fix-poweriter.
(cherry picked from commit e3441e3f68923224d5b576e6112917cf1fe1f89a)
Signed-off-by: Xiangrui Meng <[email protected]>
commit ec40c5a59fe45e49496db6e0082ddc65c937a857
Author: Amit Dev <[email protected]>
Date: 2016-02-14T11:41:27Z
[SPARK-13300][DOCUMENTATION] Added pygments.rb dependancy
Looks like pygments.rb gem is also required for jekyll build to work. At
least on Ubuntu/RHEL I could not do build without this dependency. So added
this to steps.
Author: Amit Dev <[email protected]>
Closes #11180 from amitdev/master.
(cherry picked from commit 331293c30242dc43e54a25171ca51a1c9330ae44)
Signed-off-by: Sean Owen <[email protected]>
commit 71f53edc0e39bc907755153b9603be8c6fcc1d93
Author: JeremyNixon <[email protected]>
Date: 2016-02-15T09:25:13Z
[SPARK-13312][MLLIB] Update java train-validation-split example in ml-guide
Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312.
This contribution is my original work and I license the work to this
project.
Author: JeremyNixon <[email protected]>
Closes #11199 from JeremyNixon/update_train_val_split_example.
(cherry picked from commit adb548365012552e991d51740bfd3c25abf0adec)
Signed-off-by: Sean Owen <[email protected]>
commit d95089190d714e3e95579ada84ac42d463f824b5
Author: Miles Yucht <[email protected]>
Date: 2016-02-16T13:01:21Z
Correct SparseVector.parse documentation
There's a small typo in the SparseVector.parse docstring (which says that
it returns a DenseVector rather than a SparseVector), which seems to be
incorrect.
Author: Miles Yucht <[email protected]>
Closes #11213 from mgyucht/fix-sparsevector-docs.
(cherry picked from commit 827ed1c06785692d14857bd41f1fd94a0853874a)
Signed-off-by: Sean Owen <[email protected]>
commit 98354cae984e3719a49050e7a6aa75dae78b12bb
Author: Sital Kedia <[email protected]>
Date: 2016-02-17T06:27:34Z
[SPARK-13279] Remove O(n^2) operation from scheduler.
This commit removes an unnecessary duplicate check in addPendingTask that
meant
that scheduling a task set took time proportional to (# tasks)^2.
Author: Sital Kedia <[email protected]>
Closes #11175 from sitalkedia/fix_stuck_driver.
(cherry picked from commit 1e1e31e03df14f2e7a9654e640fb2796cf059fe0)
Signed-off-by: Kay Ousterhout <[email protected]>
commit 66106a660149607348b8e51994eb2ce29d67abc0
Author: Christopher C. Aycock <[email protected]>
Date: 2016-02-17T19:24:18Z
[SPARK-13350][DOCS] Config doc updated to state that PYSPARK_PYTHON's
default is "python2.7"
Author: Christopher C. Aycock <[email protected]>
Closes #11239 from chrisaycock/master.
(cherry picked from commit a7c74d7563926573c01baf613708a0f105a03e57)
Signed-off-by: Josh Rosen <[email protected]>
commit 16f35c4c6e7e56bdb1402eab0877da6e8497cb3f
Author: Sean Owen <[email protected]>
Date: 2016-02-18T20:14:30Z
[SPARK-13371][CORE][STRING] TaskSetManager.dequeueSpeculativeTask compares
Option and String directly.
## What changes were proposed in this pull request?
Fix some comparisons between unequal types that cause IJ warnings and in at
least one case a likely bug (TaskSetManager)
## How was the this patch tested?
Running Jenkins tests
Author: Sean Owen <[email protected]>
Closes #11253 from srowen/SPARK-13371.
(cherry picked from commit 78562535feb6e214520b29e0bbdd4b1302f01e93)
Signed-off-by: Andrew Or <[email protected]>
commit 699644c692472e5b78baa56a1a6c44d8d174e70e
Author: Michael Armbrust <[email protected]>
Date: 2016-02-22T23:27:29Z
[SPARK-12546][SQL] Change default number of open parquet files
A common problem that users encounter with Spark 1.6.0 is that writing to a
partitioned parquet table OOMs. The root cause is that parquet allocates a
significant amount of memory that is not accounted for by our own mechanisms.
As a workaround, we can ensure that only a single file is open per task unless
the user explicitly asks for more.
Author: Michael Armbrust <[email protected]>
Closes #11308 from marmbrus/parquetWriteOOM.
(cherry picked from commit 173aa949c309ff7a7a03e9d762b9108542219a95)
Signed-off-by: Michael Armbrust <[email protected]>
commit 85e6a2205d4549c81edbc2238fd15659120cee78
Author: Shixiong Zhu <[email protected]>
Date: 2016-02-23T01:42:30Z
[SPARK-13298][CORE][UI] Escape "label" to avoid DAG being broken by some
special character
## What changes were proposed in this pull request?
When there are some special characters (e.g., `"`, `\`) in `label`, DAG
will be broken. This patch just escapes `label` to avoid DAG being broken by
some special characters
## How was the this patch tested?
Jenkins tests
Author: Shixiong Zhu <[email protected]>
Closes #11309 from zsxwing/SPARK-13298.
(cherry picked from commit a11b3995190cb4a983adcc8667f7b316cce18d24)
Signed-off-by: Andrew Or <[email protected]>
commit f7898f9e2df131fa78200f6034508e74a78c2a44
Author: Daoyuan Wang <[email protected]>
Date: 2016-02-23T02:13:32Z
[SPARK-11624][SPARK-11972][SQL] fix commands that need hive to exec
In SparkSQLCLI, we have created a `CliSessionState`, but then we call
`SparkSQLEnv.init()`, which will start another `SessionState`. This would lead
to exception because `processCmd` need to get the `CliSessionState` instance by
calling `SessionState.get()`, but the return value would be a instance of
`SessionState`. See the exception below.
spark-sql> !echo "test";
Exception in thread "main" java.lang.ClassCastException:
org.apache.hadoop.hive.ql.session.SessionState cannot be cast to
org.apache.hadoop.hive.cli.CliSessionState
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Author: Daoyuan Wang <[email protected]>
Closes #9589 from adrian-wang/clicommand.
(cherry picked from commit 5d80fac58f837933b5359a8057676f45539e53af)
Signed-off-by: Michael Armbrust <[email protected]>
Conflicts:
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
commit 40d11d0492bcdf4aa442e527e69804e53b4135e9
Author: Michael Armbrust <[email protected]>
Date: 2016-02-23T02:25:48Z
Update branch-1.6 for 1.6.1 release
commit 152252f15b7ee2a9b0d53212474e344acd8a55a9
Author: Patrick Wendell <[email protected]>
Date: 2016-02-23T02:30:24Z
Preparing Spark release v1.6.1-rc1
commit 290279808e5e9e91d7c349ccec12ff12b99a4556
Author: Patrick Wendell <[email protected]>
Date: 2016-02-23T02:30:30Z
Preparing development version 1.6.1-SNAPSHOT
commit d31854da5155550f4e9c5e717c92dfec87d0ff6a
Author: Earthson Lu <[email protected]>
Date: 2016-02-23T07:40:36Z
[SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, false)
fix for branch-1.6
https://issues.apache.org/jira/browse/SPARK-13359
Author: Earthson Lu <[email protected]>
Closes #11237 from Earthson/SPARK-13359.
commit 0784e02fd438e5fa2e6639d6bba114fa647dad23
Author: Xiangrui Meng <[email protected]>
Date: 2016-02-23T07:54:21Z
[SPARK-13355][MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply
`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We
call it in LDA without validating this requirement. So it might introduce
errors. Replacing it by `Graph.apply` would be safer and more proper because it
is a public API. The tests still pass. So maybe it is safe to use
`fromExistingRDDs` here (though it doesn't seem so based on the implementation)
or the test cases are special. jkbradley ankurdave
Author: Xiangrui Meng <[email protected]>
Closes #11226 from mengxr/SPARK-13355.
(cherry picked from commit 764ca18037b6b1884fbc4be9a011714a81495020)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 573a2c97e9a9b8feae22f8af173fb158d59e5332
Author: Franklyn D'souza <[email protected]>
Date: 2016-02-23T23:34:04Z
[SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns.
## What changes were proposed in this pull request?
This PR adds equality operators to UDT classes so that they can be
correctly tested for dataType equality during union operations.
This was previously causing `"AnalysisException: u"unresolved operator
'Union;""` when trying to unionAll two dataframes with UDT columns as below.
```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types
schema = types.StructType([types.StructField("point", PythonOnlyUDT(),
True)])
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
c = a.unionAll(b)
```
## How was the this patch tested?
Tested using two unit tests in sql/test.py and the DataFrameSuite.
Additional information here :
https://issues.apache.org/jira/browse/SPARK-13410
rxin
Author: Franklyn D'souza <[email protected]>
Closes #11333 from damnMeddlingKid/udt-union-patch.
commit 06f4fce29227f9763d9f9abff6e7459542dce261
Author: Shixiong Zhu <[email protected]>
Date: 2016-02-24T13:35:36Z
[SPARK-13390][SQL][BRANCH-1.6] Fix the issue that Iterator.map().toSeq is
not Serializable
## What changes were proposed in this pull request?
`scala.collection.Iterator`'s methods (e.g., map, filter) will return an
`AbstractIterator` which is not Serializable. E.g.,
```Scala
scala> val iter = Array(1, 2, 3).iterator.map(_ + 1)
iter: Iterator[Int] = non-empty iterator
scala> println(iter.isInstanceOf[Serializable])
false
```
If we call something like `Iterator.map(...).toSeq`, it will create a
`Stream` that contains a non-serializable `AbstractIterator` field and make the
`Stream` be non-serializable.
This PR uses `toArray` instead of `toSeq` to fix such issue in `def
createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame`.
## How was the this patch tested?
Jenkins tests.
Author: Shixiong Zhu <[email protected]>
Closes #11334 from zsxwing/SPARK-13390.
commit fe71cabd46e4d384e8790dbfdda892df24b48e92
Author: Yin Huai <[email protected]>
Date: 2016-02-24T21:34:53Z
[SPARK-13475][TESTS][SQL] HiveCompatibilitySuite should still run in PR
builder even if a PR only changes sql/core
## What changes were proposed in this pull request?
`HiveCompatibilitySuite` should still run in PR build even if a PR only
changes sql/core. So, I am going to remove `ExtendedHiveTest` annotation from
`HiveCompatibilitySuite`.
https://issues.apache.org/jira/browse/SPARK-13475
Author: Yin Huai <[email protected]>
Closes #11351 from yhuai/SPARK-13475.
(cherry picked from commit bc353805bd98243872d520e05fa6659da57170bf)
Signed-off-by: Yin Huai <[email protected]>
commit 897599601a5ca0f95fd70f16e89df58b9b17705c
Author: huangzhaowei <[email protected]>
Date: 2016-02-25T07:52:17Z
[SPARK-13482][MINOR][CONFIGURATION] Make consistency of the configuraiton
named in TransportConf.
`spark.storage.memoryMapThreshold` has two kind of the value, one is
2*1024*1024 as integer and the other one is '2m' as string.
"2m" is recommanded in document but it will go wrong if the code goes into
`TransportConf#memoryMapBytes`.
[Jira](https://issues.apache.org/jira/browse/SPARK-13482)
Author: huangzhaowei <[email protected]>
Closes #11360 from SaintBacchus/SPARK-13482.
(cherry picked from commit 264533b553be806b6c45457201952e83c028ec78)
Signed-off-by: Reynold Xin <[email protected]>
commit 3cc938ac8124b8445f171baa365fa44a47962cc9
Author: Cheng Lian <[email protected]>
Date: 2016-02-25T12:43:03Z
[SPARK-13473][SQL] Don't push predicate through project with
nondeterministic field(s)
## What changes were proposed in this pull request?
Predicates shouldn't be pushed through project with nondeterministic
field(s).
See https://github.com/graphframes/graphframes/pull/23 and SPARK-13473 for
more details.
This PR targets master, branch-1.6, and branch-1.5.
## How was this patch tested?
A test case is added in `FilterPushdownSuite`. It constructs a query plan
where a filter is over a project with a nondeterministic field. Optimized query
plan shouldn't change in this case.
Author: Cheng Lian <[email protected]>
Closes #11348 from
liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.
(cherry picked from commit 3fa6491be66dad690ca5329dd32e7c82037ae8c1)
Signed-off-by: Wenchen Fan <[email protected]>
commit cb869a143d338985c3d99ef388dd78b1e3d90a73
Author: Oliver Pierson <[email protected]>
Date: 2016-02-25T13:24:46Z
[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large
DataFrames
Change line 113 of QuantileDiscretizer.scala to
`val requiredSamples = math.max(numBins * numBins, 10000.0)`
so that `requiredSamples` is a `Double`. This will fix the division in
line 114 which currently results in zero if `requiredSamples < dataset.count`
Manual tests. I was having a problems using QuantileDiscretizer with my a
dataset and after making this change QuantileDiscretizer behaves as expected.
Author: Oliver Pierson <[email protected]>
Author: Oliver Pierson <[email protected]>
Closes #11319 from oliverpierson/SPARK-13444.
(cherry picked from commit 6f8e835c68dff6fcf97326dc617132a41ff9d043)
Signed-off-by: Sean Owen <[email protected]>
commit 1f031635ffb4df472ad0d9c00bc82ebb601ebbb5
Author: Terence Yim <[email protected]>
Date: 2016-02-25T13:29:30Z
[SPARK-13441][YARN] Fix NPE in yarn Client.createConfArchive method
## What changes were proposed in this pull request?
Instead of using result of File.listFiles() directly, which may throw NPE,
check for null first. If it is null, log a warning instead
## How was the this patch tested?
Ran the ./dev/run-tests locally
Tested manually on a cluster
Author: Terence Yim <[email protected]>
Closes #11337 from chtyim/fixes/SPARK-13441-null-check.
(cherry picked from commit fae88af18445c5a88212b4644e121de4b30ce027)
Signed-off-by: Sean Owen <[email protected]>
commit e3802a7522a83b91c84d0ee6f721a768a485774b
Author: Michael Gummelt <[email protected]>
Date: 2016-02-25T13:32:09Z
[SPARK-13439][MESOS] Document that spark.mesos.uris is comma-separated
Author: Michael Gummelt <[email protected]>
Closes #11311 from mgummelt/document_csv.
(cherry picked from commit c98a93ded36db5da2f3ebd519aa391de90927688)
Signed-off-by: Sean Owen <[email protected]>
commit 5f7440b2529a0f6edfed5038756c004acecbce39
Author: huangzhaowei <[email protected]>
Date: 2016-02-25T15:14:19Z
[SPARK-12316] Wait a minutes to avoid cycle calling.
When application end, AM will clean the staging dir.
But if the driver trigger to update the delegation token, it will can't
find the right token file and then it will endless cycle call the method
'updateCredentialsIfRequired'.
Then it lead driver StackOverflowError.
https://issues.apache.org/jira/browse/SPARK-12316
Author: huangzhaowei <[email protected]>
Closes #10475 from SaintBacchus/SPARK-12316.
(cherry picked from commit 5fcf4c2bfce4b7e3543815c8e49ffdec8072c9a2)
Signed-off-by: Tom Graves <[email protected]>
commit d59a08f7c1c455d86e7ee3d6522a3e9c55f9ee02
Author: Xiangrui Meng <[email protected]>
Date: 2016-02-25T20:28:03Z
Revert "[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on
large DataFrames"
This reverts commit cb869a143d338985c3d99ef388dd78b1e3d90a73.
commit abe8f991a32bef92fbbcd2911836bb7d8e61ca97
Author: Yu ISHIKAWA <[email protected]>
Date: 2016-02-25T21:21:33Z
[SPARK-12874][ML] ML StringIndexer does not protect itself from column name
duplication
## What changes were proposed in this pull request?
ML StringIndexer does not protect itself from column name duplication.
We should still improve a way to validate a schema of `StringIndexer` and
`StringIndexerModel`. However, it would be great to fix at another issue.
## How was this patch tested?
unit test
Author: Yu ISHIKAWA <[email protected]>
Closes #11370 from yu-iskw/SPARK-12874.
(cherry picked from commit 14e2700de29d06460179a94cc9816bcd37344cf7)
Signed-off-by: Xiangrui Meng <[email protected]>
commit a57f87ee4aafdb97c15f4076e20034ea34c7e2e5
Author: Yin Huai <[email protected]>
Date: 2016-02-26T20:34:03Z
[SPARK-13454][SQL] Allow users to drop a table with a name starting with an
underscore.
## What changes were proposed in this pull request?
This change adds a workaround to allow users to drop a table with a name
starting with an underscore. Without this patch, we can create such a table,
but we cannot drop it. The reason is that Hive's parser unquote an quoted
identifier (see
https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g#L453).
So, when we issue a drop table command to Hive, a table name starting with an
underscore is actually not quoted. Then, Hive will complain about it because it
does not support a table name starting with an underscore without using
backticks (underscores are allowed as long as it is not the first char though).
## How was this patch tested?
Add a test to make sure we can drop a table with a name starting with an
underscore.
https://issues.apache.org/jira/browse/SPARK-13454
Author: Yin Huai <[email protected]>
Closes #11349 from yhuai/fixDropTable.
commit 8a43c3bfbcd9d6e3876e09363dba604dc7e63dc3
Author: Josh Rosen <[email protected]>
Date: 2016-02-27T02:40:00Z
[SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to
home.apache.org
Due to the people.apache.org -> home.apache.org migration, we need to
update our packaging scripts to publish artifacts to the new server. Because
the new server only supports sftp instead of ssh, we need to update the scripts
to use lftp instead of ssh + rsync.
Author: Josh Rosen <[email protected]>
Closes #11350 from JoshRosen/update-release-scripts-for-apache-home.
(cherry picked from commit f77dc4e1e202942aa8393fb5d8f492863973fe17)
Signed-off-by: Josh Rosen <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]