GitHub user debasish83 reopened a pull request:
https://github.com/apache/spark/pull/2705
[MLLIB] [WIP] SPARK-2426: Quadratic Minimization for MLlib ALS
ALS is a generic algorithm for matrix factorization which is equally
applicable for both feature space and similarity space. Current ALS support L2
regularization and positivity constraint. This PR introduces userConstraint and
productConstraint to ALS and let's the user select different constraints for
user and product solves. The supported constraints are the following:
1. SMOOTH : default ALS with L2 regularization
2. POSITIVE: ALS with positive factors
3. BOUNDS: ALS with factors bounded within upper and lower bound (default
within 0 and 1)
4. SPARSE: ALS with L1 regularization
5. EQUALITY: ALS with equality constraint (default the factors sum up to 1
and positive)
First let's focus on the problem formulation. Both implicit and explicit
feedback ALS formulation can be written as a quadratic minimization problem.
The quadratic objective can be written as xtHx + ctx. Each of the respective
constraints take the following form:
minimize xtHx + ctx
s.t ||x||1 <= c (SPARSE constraint)
We rewrite the objective as f(x) = xtHx + ctx and the constraint as an
indicator function g(x)
Now minimization of f(x) + g(x) can be carried out using various forward
backward splitting algorithms. We choose ADMM for the first version based on
our experimentation with ECOS IP solver and MOSEK comparisons. I will document
the comparisons.
Details of the algorithm are in the following reference:
http://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf
Right now the default parameters of alpha, rho are set as 1.0 but the
following issues show up in experiments with MovieLens dataset:
1. ~3X higher iterations as compared to NNLS
2. For SPARSE we are hitting the max iterations (400) around 10% of the time
3. For EQUALITY rho is set at 50 based on a reference from Professor Boyd
on optimal control
We choose ADMM as the baseline solver but this PR will explore the
following solver enhancements to decrease the iteration count:
1. Accelerated ADMM using Nesterov acceleration
2. FISTA style forward backward splitting
For use-cases the PR is focused on the following:
1. Sparse matrix factorization to improve recommendation
On Movielens data right now the RMSE with SPARSE is 10% (1.04) lower than
the Mahout/Spark baseline (0.9) but have not looked into map, prec@k and ndcg@k
measures. Using the PR from @coderxiang to look into IR measures.
Example run:
MASTER=spark://localhost:7077 ./bin/run-example mllib.MovieLensALS --rank
20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065
--productConstraint SPARSE --lambdaProduct 0.1 --kryo
hdfs://localhost:8020/sandbox/movielens/
2. Topic modeling using LSA
References:
2007 Sparse coding:
papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.pdf
2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in
Graphlab):
https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
2012 Sparse Coding + MR/MPI Microsoft:
http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf
Implementing the 20NG flow to validate the sparse coding result improvement
over LDA based topic modeling.
3. Topic modeling using PLSA
Reference:
Tutorial on Probabilistic Topic Modeling: Additive Regularization for
Stochastic Matrix Factorization
The EQUALITY formulation with a Quadratic loss is an approximation to the
KL divergence loss being used in PLSA. We are interested to see if it improves
the result further as compared to the Sparse coding.
Next steps:
1. Improve the convergence rate of forward-backward splitting on quadratic
problems
2. Move the test-cases to QuadraticMinimizerSuite.scala
3. Generate results for each of the use-cases and add tests related to each
use-case
Related future PRs:
1. Scale the factorization rank and remove the need to construct H matrix
2. Replace the quadratic loss xtHx + ctx with a Convex loss
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/debasish83/spark qp-als
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2705.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2705
----
commit 9c439d33160ef3b31173381735dfa8cfb7d552ba
Author: Xiangrui Meng <[email protected]>
Date: 2014-10-09T05:35:14Z
[SPARK-3856][MLLIB] use norm operator after breeze 0.10 upgrade
Got warning msg:
~~~
[warn]
/Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala:50:
method norm in trait NumericOps is deprecated: Use norm(XXX) instead of
XXX.norm
[warn] var norm = vector.toBreeze.norm(p)
~~~
dbtsai
Author: Xiangrui Meng <[email protected]>
Closes #2718 from mengxr/SPARK-3856 and squashes the following commits:
4f38169 [Xiangrui Meng] use norm operator
commit b9df8af62e8d7b263a668dfb6e9668ab4294ea37
Author: Anand Avati <[email protected]>
Date: 2014-10-09T06:45:17Z
[SPARK-2805] Upgrade to akka 2.3.4
Upgrade to akka 2.3.4
Author: Anand Avati <[email protected]>
Closes #1685 from avati/SPARK-1812-akka-2.3 and squashes the following
commits:
57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on
akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
commit 86b392942daf61fed2ff7490178b128107a0e856
Author: Xiangrui Meng <[email protected]>
Date: 2014-10-09T07:00:24Z
[SPARK-3844][UI] Truncate appName in WebUI if it is too long
Truncate appName in WebUI if it is too long.
Author: Xiangrui Meng <[email protected]>
Closes #2707 from mengxr/truncate-app-name and squashes the following
commits:
87834ce [Xiangrui Meng] move scala import below java
c7111dc [Xiangrui Meng] truncate appName in WebUI if it is too long
commit 13cab5ba44e2f8d2d2204b3b0d39d7c23a819bdb
Author: nartz <[email protected]>
Date: 2014-10-09T07:02:11Z
add spark.driver.memory to config docs
It took me a minute to track this down, so I thought it could be useful to
have it in the docs.
I'm unsure if 512mb is the default for spark.driver.memory? Also - there
could be a better value for the 'description' to differentiate it from
spark.executor.memory.
Author: nartz <[email protected]>
Author: Nathan Artz <[email protected]>
Closes #2410 from nartz/docs/add-spark-driver-memory-to-config-docs and
squashes the following commits:
a2f6c62 [nartz] Update configuration.md
74521b8 [Nathan Artz] add spark.driver.memory to config docs
commit 14f222f7f76cc93633aae27a94c0e556e289ec56
Author: Qiping Li <[email protected]>
Date: 2014-10-09T08:36:58Z
[SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree training
Currently, the implementation does one unnecessary aggregation step. The
aggregation step for level L (to choose splits) gives enough information to set
the predictions of any leaf nodes at level L+1. We can use that info and skip
the aggregation step for the last level of the tree (which only has leaf nodes).
### Implementation Details
Each node now has a `impurity` field and the `predict` is changed from type
`Double` to type `Predict`(this can be used to compute predict probability in
the future) When compute best splits for each node, we also compute impurity
and predict for the child nodes, which is used to constructed newly allocated
child nodes. So at level L, we have set impurity and predict for nodes at level
L +1.
If level L+1 is the last level, then we can avoid aggregation. What's more,
calculation of parent impurity in
Top nodes for each tree needs to be treated differently because we have to
compute impurity and predict for them first. In `binsToBestSplit`, if current
node is top node(level == 0), we calculate impurity and predict first.
after finding best split, top node's predict and impurity is set to the
calculated value. Non-top nodes's impurity and predict are already calculated
and don't need to be recalculated again. I have considered to add a
initialization step to set top nodes' impurity and predict and then we can
treat all nodes in the same way, but this will need a lot of duplication of
code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I
choose the current way.
CC mengxr manishamde jkbradley, please help me review this, thanks.
Author: Qiping Li <[email protected]>
Closes #2708 from chouqin/avoid-agg and squashes the following commits:
8e269ea [Qiping Li] adjust code and comments
eefeef1 [Qiping Li] adjust comments and check child nodes' impurity
c41b1b6 [Qiping Li] fix pyspark unit test
7ad7a71 [Qiping Li] fix unit test
822c912 [Qiping Li] add comments and unit test
e41d715 [Qiping Li] fix bug in test suite
6cc0333 [Qiping Li] SPARK-3158: Avoid 1 extra aggregation for DecisionTree
training
commit 1e0aa4deba65aa1241b9a30edb82665eae27242f
Author: GuoQiang Li <[email protected]>
Date: 2014-10-09T16:22:32Z
[Minor] use norm operator after breeze 0.10 upgrade
cc mengxr
Author: GuoQiang Li <[email protected]>
Closes #2730 from witgo/SPARK-3856 and squashes the following commits:
2cffce1 [GuoQiang Li] use norm operator after breeze 0.10 upgrade
commit 73bf3f2e0c03216aa29c25fea2d97205b5977903
Author: zsxwing <[email protected]>
Date: 2014-10-09T18:27:21Z
[SPARK-3741] Make ConnectionManager propagate errors properly and add mo...
...re logs to avoid Executors swallowing errors
This PR made the following changes:
* Register a callback to `Connection` so that the error will be propagated
properly.
* Add more logs so that the errors won't be swallowed by Executors.
* Use trySuccess/tryFailure because `Promise` doesn't allow to call
success/failure more than once.
Author: zsxwing <[email protected]>
Closes #2593 from zsxwing/SPARK-3741 and squashes the following commits:
1d5aed5 [zsxwing] Fix naming
0b8a61c [zsxwing] Merge branch 'master' into SPARK-3741
764aec5 [zsxwing] [SPARK-3741] Make ConnectionManager propagate errors
properly and add more logs to avoid Executors swallowing errors
commit b77a02f41c60d869f48b65e72ed696c05b30bc48
Author: Vida Ha <[email protected]>
Date: 2014-10-09T20:13:31Z
[SPARK-3752][SQL]: Add tests for different UDF's
Author: Vida Ha <[email protected]>
Closes #2621 from vidaha/vida/SPARK-3752 and squashes the following commits:
d7fdbbc [Vida Ha] Add tests for different UDF's
commit 752e90f15e0bb82d283f05eff08df874b48caed9
Author: Yash Datta <[email protected]>
Date: 2014-10-09T19:59:14Z
[SPARK-3711][SQL] Optimize where in clause filter queries
The In case class is replaced by a InSet class in case all the filters are
literals, which uses a hashset instead of Sequence, thereby giving significant
performance improvement (earlier the seq was using a worst case linear match
(exists method) since expressions were assumed in the filter list) . Maximum
improvement should be visible in case small percentage of large data matches
the filter list.
Author: Yash Datta <[email protected]>
Closes #2561 from saucam/branch-1.1 and squashes the following commits:
4bf2d19 [Yash Datta] SPARK-3711: 1. Fix code style and import order
2. Fix optimization condition 3. Add tests for null in filter
list 4. Add test case that optimization is not triggered in case of
attributes in filter list
afedbcd [Yash Datta] SPARK-3711: 1. Add test cases for InSet class in
ExpressionEvaluationSuite 2. Add class OptimizedInSuite on the
lines of ConstantFoldingSuite, for the optimized In clause
0fc902f [Yash Datta] SPARK-3711: UnaryMinus will be handled by
constantFolding
bd84c67 [Yash Datta] SPARK-3711: Incorporate review comments. Move
optimization of In clause to Optimizer.scala by adding a rule. Add appropriate
comments
430f5d1 [Yash Datta] SPARK-3711: Optimize the filter list in case of
negative values as well
bee98aa [Yash Datta] SPARK-3711: Optimize where in clause filter queries
commit 2c8851343a2e4d1d5b3a2b959eaa651a92982a72
Author: scwf <[email protected]>
Date: 2014-10-09T20:22:36Z
[SPARK-3806][SQL] Minor fix for CliSuite
To fix two issues in CliSuite
1 CliSuite throw IndexOutOfBoundsException:
Exception in thread "Thread-6" java.lang.IndexOutOfBoundsException: 6
at
scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
at
org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
at
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
at
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
at
scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
at
scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
at
scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
at
scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
at
scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)
Actually, it is the Mutil-Threads lead to this problem.
2 Using ```line.startsWith``` instead ```line.contains``` to assert
expected answer. This is a tiny bug in CliSuite, for test case "Simple
commands", there is a expected answers "5", if we use ```contains``` that means
output like "14/10/06 11:```5```4:36 INFO CliDriver: Time taken: 1.078 seconds"
or "14/10/06 11:54:36 INFO StatsReportListener: 0% ```5```% 10%
25% 50% 75% 90% 95% 100%" will make the assert true.
Author: scwf <[email protected]>
Closes #2666 from scwf/clisuite and squashes the following commits:
11430db [scwf] fix-clisuite
commit e7edb723d22869f228b838fd242bf8e6fe73ee19
Author: cocoatomo <[email protected]>
Date: 2014-10-09T20:46:26Z
[SPARK-3868][PySpark] Hard to recognize which module is tested from
unit-tests.log
./python/run-tests script display messages about which test it is running
currently on stdout but not write them on unit-tests.log.
It is harder for us to recognize what test programs were executed and which
test was failed.
Author: cocoatomo <[email protected]>
Closes #2724 from cocoatomo/issues/3868-display-testing-module-name and
squashes the following commits:
c63d9fa [cocoatomo] [SPARK-3868][PySpark] Hard to recognize which module is
tested from unit-tests.log
commit ec4d40e48186af18e25517e0474020720645f583
Author: Mike Timper <[email protected]>
Date: 2014-10-09T21:02:27Z
[SPARK-3853][SQL] JSON Schema support for Timestamp fields
In JSONRDD.scala, add 'case TimestampType' in the enforceCorrectType
function and a toTimestamp function.
Author: Mike Timper <[email protected]>
Closes #2720 from mtimper/master and squashes the following commits:
9386ab8 [Mike Timper] Fix and tests for SPARK-3853
commit 1faa1135a3fc0acd89f934f01a4a2edefcb93d33
Author: Patrick Wendell <[email protected]>
Date: 2014-10-09T21:50:36Z
Revert "[SPARK-2805] Upgrade to akka 2.3.4"
This reverts commit b9df8af62e8d7b263a668dfb6e9668ab4294ea37.
commit 1c7f0ab302de9f82b1bd6da852d133823bc67c66
Author: Yin Huai <[email protected]>
Date: 2014-10-09T21:57:27Z
[SPARK-3339][SQL] Support for skipping json lines that fail to parse
This PR aims to provide a way to skip/query corrupt JSON records. To do so,
we introduce an internal column to hold corrupt records (the default name is
`_corrupt_record`. This name can be changed by setting the value of
`spark.sql.columnNameOfCorruptRecord`). When there is a parsing error, we will
put the corrupt record in its unparsed format to the internal column. Users can
skip/query this column through SQL.
* To query those corrupt records
```
-- For Hive parser
SELECT `_corrupt_record`
FROM jsonTable
WHERE `_corrupt_record` IS NOT NULL
-- For our SQL parser
SELECT _corrupt_record
FROM jsonTable
WHERE _corrupt_record IS NOT NULL
```
* To skip corrupt records and query regular records
```
-- For Hive parser
SELECT field1, field2
FROM jsonTable
WHERE `_corrupt_record` IS NULL
-- For our SQL parser
SELECT field1, field2
FROM jsonTable
WHERE _corrupt_record IS NULL
```
Generally, it is not recommended to change the name of the internal column.
If the name has to be changed to avoid possible name conflicts, you can use
`sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)`
or `sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>)`.
Author: Yin Huai <[email protected]>
Closes #2680 from yhuai/corruptJsonRecord and squashes the following
commits:
4c9828e [Yin Huai] Merge remote-tracking branch 'upstream/master' into
corruptJsonRecord
309616a [Yin Huai] Change the default name of corrupt record to
"_corrupt_record".
b4a3632 [Yin Huai] Merge remote-tracking branch 'upstream/master' into
corruptJsonRecord
9375ae9 [Yin Huai] Set the column name of corrupt json record back to the
default one after the unit test.
ee584c0 [Yin Huai] Provide a way to query corrupt json records as unparsed
strings.
commit 0c0e09f567deb775ee378f5385a16884f68b332d
Author: Daoyuan Wang <[email protected]>
Date: 2014-10-09T21:59:03Z
[SPARK-3412][SQL]add missing row api
chenghao-intel assigned this to me, check PR #2284 for previous discussion
Author: Daoyuan Wang <[email protected]>
Closes #2529 from adrian-wang/rowapi and squashes the following commits:
c6594b2 [Daoyuan Wang] using boxed
7b7e6e3 [Daoyuan Wang] update pattern match
7a39456 [Daoyuan Wang] rename file and refresh getAs[T]
4c18c29 [Daoyuan Wang] remove setAs[T] and null judge
1614493 [Daoyuan Wang] add missing row api
commit bc3b6cb06153d6b05f311dd78459768b6cf6a404
Author: Nathan Howell <[email protected]>
Date: 2014-10-09T22:03:01Z
[SPARK-3858][SQL] Pass the generator alias into logical plan node
The alias parameter is being ignored, which makes it more difficult to
specify a qualifier for Generator expressions.
Author: Nathan Howell <[email protected]>
Closes #2721 from NathanHowell/SPARK-3858 and squashes the following
commits:
8aa0f43 [Nathan Howell] [SPARK-3858][SQL] Pass the generator alias into
logical plan node
commit ac302052870a650d56f2d3131c27755bb2960ad7
Author: ravipesala <[email protected]>
Date: 2014-10-09T22:14:58Z
[SPARK-3813][SQL] Support "case when" conditional functions in Spark SQL.
"case when" conditional function is already supported in Spark SQL but
there is no support in SqlParser. So added parser support to it.
Author : ravipesala ravindra.pesalahuawei.com
Author: ravipesala <[email protected]>
Closes #2678 from ravipesala/SPARK-3813 and squashes the following commits:
70c75a7 [ravipesala] Fixed styles
713ea84 [ravipesala] Updated as per admin comments
709684f [ravipesala] Changed parser to support case when function.
commit 4e9b551a0b807f5a2cc6679165c8be4e88a3d077
Author: Josh Rosen <[email protected]>
Date: 2014-10-09T23:08:07Z
[SPARK-3772] Allow `ipython` to be used by Pyspark workers; IPython support
improvements:
This pull request addresses a few issues related to PySpark's IPython
support:
- Fix the remaining uses of the '-u' flag, which IPython doesn't support
(see SPARK-3772).
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old
name is reserved in case we ever want to allow the worker Python options to be
customized (this variable was introduced in #2554 and hasn't landed in a
release yet, so this doesn't break any compatibility).
- Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use
`ipython` while the workers use a different Python version.
- Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
- Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid
breaking existing example programs).
There are more details in a block comment in `bin/pyspark`.
Author: Josh Rosen <[email protected]>
Closes #2651 from JoshRosen/SPARK-3772 and squashes the following commits:
7b8eb86 [Josh Rosen] More changes to PySpark python executable
configuration:
c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark
workers; IPython fixes:
commit 2837bf8548db7e9d43f6eefedf5a73feb22daedb
Author: Michael Armbrust <[email protected]>
Date: 2014-10-10T00:54:02Z
[SPARK-3798][SQL] Store the output of a generator in a val
This prevents it from changing during serialization, leading to corrupted
results.
Author: Michael Armbrust <[email protected]>
Closes #2656 from marmbrus/generateBug and squashes the following commits:
efa32eb [Michael Armbrust] Store the output of a generator in a val. This
prevents it from changing during serialization.
commit 363baacaded56047bcc63276d729ab911e0336cf
Author: Sean Owen <[email protected]>
Date: 2014-10-10T01:21:59Z
SPARK-3811 [CORE] More robust / standard Utils.deleteRecursively,
Utils.createTempDir
I noticed a few issues with how temp directories are created and deleted:
*Minor*
* Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in
many tests to make a temp dir, but `Utils.createTempDir()` seems to be the
standard Spark mechanism
* Call to `File.deleteOnExit()` could be pushed into
`Utils.createTempDir()` as well, along with this replacement
* _I messed up the message in an exception in `Utils` in SPARK-3794; fixed
here_
*Bit Less Minor*
* `Utils.deleteRecursively()` fails immediately if any `IOException`
occurs, instead of trying to delete any remaining files and subdirectories.
I've observed this leave temp dirs around. I suggest changing it to continue in
the face of an exception and throw one of the possibly several exceptions that
occur at the end.
* `Utils.createTempDir()` will add a JVM shutdown hook every time the
method is called. Even if the subdir is the parent of another parent dir, since
this check is inside the hook. However `Utils` manages a set of all dirs to
delete on shutdown already, called `shutdownDeletePaths`. A single hook can be
registered to delete all of these on exit. This is how Tachyon temp paths are
cleaned up in `TachyonBlockManager`.
I noticed a few other things that might be changed but wanted to ask first:
* Shouldn't the set of dirs to delete be `File`, not just `String` paths?
* `Utils` manages the set of `TachyonFile` that have been registered for
deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should
this logic not live together, and not in `Utils`? it's more specific to
Tachyon, and looks a slight bit odd to import in such a generic place.
Author: Sean Owen <[email protected]>
Closes #2670 from srowen/SPARK-3811 and squashes the following commits:
071ae60 [Sean Owen] Update per @vanzin's review
da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths
even when an exception occurs; use one shutdown hook instead of one per method
call to delete temp dirs
3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of
Files.createTempDir
commit edf02da389f75df5a42465d41f035d6b65599848
Author: Cheng Lian <[email protected]>
Date: 2014-10-10T01:25:06Z
[SPARK-3654][SQL] Unifies SQL and HiveQL parsers
This PR is a follow up of #2590, and tries to introduce a top level SQL
parser entry point for all SQL dialects supported by Spark SQL.
A top level parser `SparkSQLParser` is introduced to handle the syntaxes
that all SQL dialects should recognize (e.g. `CACHE TABLE`, `UNCACHE TABLE` and
`SET`, etc.). For all the syntaxes this parser doesn't recognize directly, it
fallbacks to a specified function that tries to parse arbitrary input to a
`LogicalPlan`. This function is typically another parser combinator like
`SqlParser`. DDL syntaxes introduced in #2475 can be moved to here.
The `ExtendedHiveQlParser` now only handle Hive specific extensions.
Also took the chance to refactor/reformat `SqlParser` for better
readability.
Author: Cheng Lian <[email protected]>
Closes #2698 from liancheng/gen-sql-parser and squashes the following
commits:
ceada76 [Cheng Lian] Minor styling fixes
9738934 [Cheng Lian] Minor refactoring, removes optional trailing ";" in
the parser
bb2ab12 [Cheng Lian] SET property value can be empty string
ce8860b [Cheng Lian] Passes test suites
e86968e [Cheng Lian] Removes debugging code
8bcace5 [Cheng Lian] Replaces digit.+ to rep1(digit) (Scala style checking
doesn't like it)
d15d54f [Cheng Lian] Unifies SQL and HiveQL parsers
commit 421382d0e728940caa3e61bc11237c61f256378a
Author: Cheng Lian <[email protected]>
Date: 2014-10-10T01:26:43Z
[SPARK-3824][SQL] Sets in-memory table default storage level to
MEMORY_AND_DISK
Using `MEMORY_AND_DISK` as default storage level for in-memory table
caching. Due to the in-memory columnar representation, recomputing an in-memory
cached table partitions can be very expensive.
Author: Cheng Lian <[email protected]>
Closes #2686 from liancheng/spark-3824 and squashes the following commits:
35d2ed0 [Cheng Lian] Removes extra space
1ab7967 [Cheng Lian] Reduces test data size to fit DiskStore.getBytes()
ba565f0 [Cheng Lian] Maks CachedBatch serializable
07f0204 [Cheng Lian] Sets in-memory table default storage level to
MEMORY_AND_DISK
commit 6f98902a3d7749e543bc493a8c62b1e3a7b924cc
Author: ravipesala <[email protected]>
Date: 2014-10-10T01:41:36Z
[SPARK-3834][SQL] Backticks not correctly handled in subquery aliases
The queries like SELECT a.key FROM (SELECT key FROM src) \`a\` does not
work as backticks in subquery aliases are not handled properly. This PR fixes
that.
Author : ravipesala ravindra.pesalahuawei.com
Author: ravipesala <[email protected]>
Closes #2737 from ravipesala/SPARK-3834 and squashes the following commits:
0e0ab98 [ravipesala] Fixing issue in backtick handling for subquery aliases
commit 411cf29fff011561f0093bb6101af87842828369
Author: Anand Avati <[email protected]>
Date: 2014-10-10T07:46:56Z
[SPARK-2805] Upgrade Akka to 2.3.4
This is a second rev of the Akka upgrade (earlier merged, but reverted). I
made a slight modification which is that I also upgrade Hive to deal with a
compatibility issue related to the protocol buffers library.
Author: Anand Avati <[email protected]>
Author: Patrick Wendell <[email protected]>
Closes #2752 from pwendell/akka-upgrade and squashes the following commits:
4c7ca3f [Patrick Wendell] Upgrading to new hive->protobuf version
57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on
akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
commit 90f73fcc47c7bf881f808653d46a9936f37c3c31
Author: Aaron Davidson <[email protected]>
Date: 2014-10-10T08:44:36Z
[SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in
ConnectionManager
In general, individual shuffle blocks are frequently small, so mmapping
them often creates a lot of waste. It may not be bad to mmap the larger ones,
but it is pretty inconvenient to get configuration into ManagedBuffer, and
besides it is unlikely to help all that much.
Author: Aaron Davidson <[email protected]>
Closes #2742 from aarondav/mmap and squashes the following commits:
a152065 [Aaron Davidson] Add other pathway back
52b6cd2 [Aaron Davidson] [SPARK-3889] Attempt to avoid SIGBUS by not
mmapping files in ConnectionManager
commit 72f36ee571ad27c7c7c70bb9aecc7e6ef51dfd44
Author: Davies Liu <[email protected]>
Date: 2014-10-10T21:14:05Z
[SPARK-3886] [PySpark] use AutoBatchedSerializer by default
Use AutoBatchedSerializer by default, which will choose the proper batch
size based on size of serialized objects, let the size of serialized batch fall
in into [64k - 640k].
In JVM, the serializer will also track the objects in batch to figure out
duplicated objects, larger batch may cause OOM in JVM.
Author: Davies Liu <[email protected]>
Closes #2740 from davies/batchsize and squashes the following commits:
52cdb88 [Davies Liu] update docs
185f2b9 [Davies Liu] use AutoBatchedSerializer by default
commit 1d72a30874a88bdbab75217f001cf2af409016e7
Author: Patrick Wendell <[email protected]>
Date: 2014-10-10T23:49:19Z
HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
We had to upgrade our Hive 0.12 version as well to deal with a protobuf
conflict (both hive and akka have been using a shaded protobuf version).
This is testing a correctly patched version of Hive 0.12.
Author: Patrick Wendell <[email protected]>
Closes #2756 from pwendell/hotfix and squashes the following commits:
cc979d0 [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
commit 0e8203f4fb721158fb27897680da476174d24c4b
Author: Prashant Sharma <[email protected]>
Date: 2014-10-11T01:39:55Z
[SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden
alternatives, can have default argument(s).
...riden alternatives, can have default argument.
Author: Prashant Sharma <[email protected]>
Closes #2750 from ScrapCodes/SPARK-2924/default-args-removed and squashes
the following commits:
d9785c3 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one
function/ctor amongst overriden alternatives, can have default argument.
commit 81015a2ba49583d730ce65b2262f50f1f2451a79
Author: cocoatomo <[email protected]>
Date: 2014-10-11T18:26:17Z
[SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6
and unittest2 is not installed
./python/run-tests search a Python 2.6 executable on PATH and use it if
available.
When using Python 2.6, it is going to import unittest2 module which is not
a standard library in Python 2.6, so it fails with ImportError.
Author: cocoatomo <[email protected]>
Closes #2759 from cocoatomo/issues/3867-unittest2-import-error and squashes
the following commits:
f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with
Python 2.6 and unittest2 is not installed
commit 7a3f589ef86200f99624fea8322e5af0cad774a7
Author: cocoatomo <[email protected]>
Date: 2014-10-11T18:51:59Z
[SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and
building warnings
Sphinx documents contains a corrupted ReST format and have some warnings.
The purpose of this issue is same as
https://issues.apache.org/jira/browse/SPARK-3773.
commit: 0e8203f4fb721158fb27897680da476174d24c4b
output
```
$ cd ./python/docs
$ make clean html
rm -rf _build/*
sphinx-build -b html -d _build/doctrees . _build/html
Making output directory...
Running Sphinx v1.2.3
loading pickled environment... not yet created
building [html]: targets for 4 source files that are out of date
updating environment: 4 added, 0 changed, 0 removed
reading sources... [100%] pyspark.sql
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring
of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends
without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring
of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends
without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of
pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected
unindent.
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] pyspark.sql
writing additional files... (12 module code pages) _modules/index search
copying static files... WARNING: html_static_path entry
u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
done
copying extra files... done
dumping search index... done
dumping object inventory... done
build succeeded, 4 warnings.
Build finished. The HTML pages are in _build/html.
```
Author: cocoatomo <[email protected]>
Closes #2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes
the following commits:
2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx
documents and building warnings
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]