[GitHub] spark pull request: Branch 1.2

hxfeng Fri, 02 Jan 2015 05:49:11 -0800

GitHub user hxfeng opened a pull request:

    https://github.com/apache/spark/pull/3880


    Branch 1.2

    update


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-1.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3880.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3880
    
----
commit 21f582f12b4d00017b990bcc232dcbf546b5dbe7
Author: Dan McClary <[email protected]>
Date:   2014-11-20T21:36:50Z

    [SPARK-4228][SQL] SchemaRDD to JSON
    
    Here's a simple fix for SchemaRDD to JSON.
    
    Author: Dan McClary <[email protected]>
    
    Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits:
    
    d714e1d [Dan McClary] fixed PEP 8 error
    cac2879 [Dan McClary] move pyspark comment and doctest to correct location
    f9471d3 [Dan McClary] added pyspark doc and doctest
    6598cee [Dan McClary] adding complex type queries
    1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite
    4a651f0 [Dan McClary] cleaned PEP and Scala style failures.  Moved tests to 
JsonSuite
    47ceff6 [Dan McClary] cleaned up scala style issues
    2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD
    4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting
    8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON
    1b11980 [Dan McClary] Map and UserDefinedTypes partially done
    11d2016 [Dan McClary] formatting and unicode deserialization default fixed
    6af72d1 [Dan McClary] deleted extaneous comment
    4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD
    149dafd [Dan McClary] wrapped scala toJSON in sql.py
    5e5eb1b [Dan McClary] switched to Jackson for JSON processing
    6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD
    aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD
    1d171aa [Dan McClary] upated missing brace on if statement
    319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228
    424f130 [Dan McClary] tests pass, ready for pull and PR
    626a5b1 [Dan McClary] added toJSON to SchemaRDD
    f7d166a [Dan McClary] added toJSON method
    5d34e37 [Dan McClary] merge resolved
    d6d19e9 [Dan McClary] pr example
    
    (cherry picked from commit b8e6886fb8ff8f667fb7e600cd727d8649cad1d1)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 72f5ba1fc152fa5dee11740f6193d5cd95bcdce3
Author: Davies Liu <[email protected]>
Date:   2014-11-20T23:31:28Z

    [SPARK-4439] [MLlib] add python api for random forest
    
    ```
        class RandomForestModel
         |  A model trained by RandomForest
         |
         |  numTrees(self)
         |      Get number of trees in forest.
         |
         |  predict(self, x)
         |      Predict values for a single data point or an RDD of points 
using the model trained.
         |
         |  toDebugString(self)
         |      Full model
         |
         |  totalNumNodes(self)
         |      Get total number of nodes, summed over all trees in the forest.
         |
    
        class RandomForest
         |  trainClassifier(cls, data, numClassesForClassification, 
categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', 
impurity='gini', maxDepth=4, maxBins=32, seed=None):
         |      Method to train a decision tree model for binary or multiclass 
classification.
         |
         |      :param data: Training dataset: RDD of LabeledPoint.
         |                   Labels should take values {0, 1, ..., 
numClasses-1}.
         |      :param numClassesForClassification: number of classes for 
classification.
         |      :param categoricalFeaturesInfo: Map storing arity of 
categorical features.
         |                                  E.g., an entry (n -> k) indicates 
that feature n is categorical
         |                                  with k categories indexed from 0: 
{0, 1, ..., k-1}.
         |      :param numTrees: Number of trees in the random forest.
         |      :param featureSubsetStrategy: Number of features to consider 
for splits at each node.
         |                                Supported: "auto" (default), "all", 
"sqrt", "log2", "onethird".
         |                                If "auto" is set, this parameter is 
set based on numTrees:
         |                                  if numTrees == 1, set to "all";
         |                                  if numTrees > 1 (forest) set to 
"sqrt".
         |      :param impurity: Criterion used for information gain 
calculation.
         |                   Supported values: "gini" (recommended) or 
"entropy".
         |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 
1 leaf node; depth 1 means
         |                       1 internal node + 2 leaf nodes. (default: 4)
         |      :param maxBins: maximum number of bins used for splitting 
features (default: 100)
         |      :param seed:  Random seed for bootstrapping and choosing 
feature subsets.
         |      :return: RandomForestModel that can be used for prediction
         |
         |   trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, 
featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, 
seed=None):
         |      Method to train a decision tree model for regression.
         |
         |      :param data: Training dataset: RDD of LabeledPoint.
         |                   Labels are real numbers.
         |      :param categoricalFeaturesInfo: Map storing arity of 
categorical features.
         |                                   E.g., an entry (n -> k) indicates 
that feature n is categorical
         |                                   with k categories indexed from 0: 
{0, 1, ..., k-1}.
         |      :param numTrees: Number of trees in the random forest.
         |      :param featureSubsetStrategy: Number of features to consider 
for splits at each node.
         |                                 Supported: "auto" (default), "all", 
"sqrt", "log2", "onethird".
         |                                 If "auto" is set, this parameter is 
set based on numTrees:
         |                                 if numTrees == 1, set to "all";
         |                                 if numTrees > 1 (forest) set to 
"onethird".
         |      :param impurity: Criterion used for information gain 
calculation.
         |                       Supported values: "variance".
         |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 
1 leaf node; depth 1 means
         |                       1 internal node + 2 leaf nodes.(default: 4)
         |      :param maxBins: maximum number of bins used for splitting 
features (default: 100)
         |      :param seed:  Random seed for bootstrapping and choosing 
feature subsets.
         |      :return: RandomForestModel that can be used for prediction
         |
    ```
    
    Author: Davies Liu <[email protected]>
    
    Closes #3320 from davies/forest and squashes the following commits:
    
    8003dfc [Davies Liu] reorder
    53cf510 [Davies Liu] fix docs
    4ca593d [Davies Liu] fix docs
    e0df852 [Davies Liu] fix docs
    0431746 [Davies Liu] rebased
    2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
forest
    885abee [Davies Liu] address comments
    dae7fc0 [Davies Liu] address comments
    89a000f [Davies Liu] fix docs
    565d476 [Davies Liu] add python api for random forest
    
    (cherry picked from commit 1c53a5db993193122bfa79574d2540149fe2cc08)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 8608ff59881b3cfa6c4cd407ba2c0af7a78e88a9
Author: ravipesala <[email protected]>
Date:   2014-11-20T23:34:03Z

    [SPARK-4513][SQL] Support relational operator '<=>' in Spark SQL
    
    The relational operator '<=>' is not working in Spark SQL. Same works in 
Spark HiveQL
    
    Author: ravipesala <[email protected]>
    
    Closes #3387 from ravipesala/<=> and squashes the following commits:
    
    7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL
    
    (cherry picked from commit 98e9419784a9ad5096cfd563fa9a433786a90bd4)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 1d7ee2b79b23f08f73a6d53f41ac8fa140b91c19
Author: Takuya UESHIN <[email protected]>
Date:   2014-11-20T23:41:24Z

    [SPARK-4318][SQL] Fix empty sum distinct.
    
    Executing sum distinct for empty table throws 
`java.lang.UnsupportedOperationException: empty.reduceLeft`.
    
    Author: Takuya UESHIN <[email protected]>
    
    Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following 
commits:
    
    8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318
    66fdb0a [Takuya UESHIN] Re-refine aggregate functions.
    6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate.
    d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate.
    1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions.
    917e533 [Takuya UESHIN] Use aggregate instead of groupBy().
    1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation.
    a5a57d2 [Takuya UESHIN] Fix empty Average.
    22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct.
    65b7dd2 [Takuya UESHIN] Fix empty sum distinct.
    
    (cherry picked from commit 2c2e7a44db2ebe44121226f3eac924a0668b991a)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 29e8d50773c40abe949d6b3284e0e89a0acb45af
Author: Cheng Hao <[email protected]>
Date:   2014-11-20T23:46:00Z

    [SPARK-2918] [SQL] Support the CTAS in EXPLAIN command
    
    Hive supports the `explain` the CTAS, which was supported by Spark SQL 
previously, however, seems it was reverted after the code refactoring in HiveQL.
    
    Author: Cheng Hao <[email protected]>
    
    Closes #3357 from chenghao-intel/explain and squashes the following commits:
    
    7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command
    
    (cherry picked from commit 6aa0fc9f4d95f09383cbcb5f79166c60697e6683)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 69e28046b5ebc1ec3afb678b4c81c69e48c02aa8
Author: Jacky Li <[email protected]>
Date:   2014-11-20T23:48:36Z

    [SQL] fix function description mistake
    
    Sample code in the description of SchemaRDD.where is not correct
    
    Author: Jacky Li <[email protected]>
    
    Closes #3344 from jackylk/patch-6 and squashes the following commits:
    
    62cd126 [Jacky Li] [SQL] fix function description mistake
    
    (cherry picked from commit ad5f1f3ca240473261162c06ffc5aa70d15a5991)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 5153aa041fd4ca8b2a4df4d635598090280655c6
Author: Davies Liu <[email protected]>
Date:   2014-11-21T00:40:25Z

    [SPARK-4477] [PySpark] remove numpy from RDDSampler
    
    In RDDSampler, it try use numpy to gain better performance for possion(), 
but the number of call of random() is only (1+faction) * N in the pure python 
implementation of possion(), so there is no much performance gain from numpy.
    
    numpy is not a dependent of pyspark, so it maybe introduce some problem, 
such as there is no numpy installed in slaves, but only installed master, as 
reported in SPARK-927.
    
    It also complicate the code a lot, so we may should remove numpy from 
RDDSampler.
    
    I also did some benchmark to verify that:
    ```
    >>> from pyspark.mllib.random import RandomRDDs
    >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
    >>> rdd.count()  # cache it
    >>> rdd.sample(True, 0.9).count()    # measure this line
    ```
    the results:
    
    |withReplacement      |  random  | numpy.random |
     ------- | ------------ |  -------
    |True | 1.5 s|  1.4 s|
    |False|  0.6 s | 0.8 s|
    
    closes #2313
    
    Note: this patch including some commits that not mirrored to github, it 
will be OK after it catches up.
    
    Author: Davies Liu <[email protected]>
    Author: Xiangrui Meng <[email protected]>
    
    Closes #3351 from davies/numpy and squashes the following commits:
    
    5c438d7 [Davies Liu] fix comment
    c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
    98eb31b [Xiangrui Meng] make poisson sampling slightly faster
    ee17d78 [Davies Liu] remove = for float
    13f7b05 [Davies Liu] Merge branch 'master' of 
http://git-wip-us.apache.org/repos/asf/spark into numpy
    f583023 [Davies Liu] fix tests
    51649f5 [Davies Liu] remove numpy in RDDSampler
    78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no 
performance gain
    f5fdf63 [Davies Liu] fix bug with int in weights
    4dfa2cd [Davies Liu] refactor
    f866bcf [Davies Liu] remove unneeded change
    c7a2007 [Davies Liu] switch to python implementation
    95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
randomSplit
    0d9b256 [Davies Liu] refactor
    1715ee3 [Davies Liu] address comments
    41fce54 [Davies Liu] randomSplit()
    
    (cherry picked from commit d39f2e9c683a4ab78b29eb3c5668325bf8568e8c)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 0f6a2eeaf20363061f9ed2d9062f3a7022c2c8ba
Author: Cheng Hao <[email protected]>
Date:   2014-11-21T00:50:59Z

    [SPARK-4244] [SQL] Support Hive Generic UDFs with constant object inspector 
parameters
    
    Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src 
LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the 
input object inspector is `ConstantObjectInspector`, however, we won't get that 
before the expression optimization executed. (Constant Folding).
    
    This PR is a work around to fix this. (As ideally, the `output` of 
LogicalPlan should be identical before and after Optimization).
    
    Author: Cheng Hao <[email protected]>
    
    Closes #3109 from chenghao-intel/optimized and squashes the following 
commits:
    
    487ff79 [Cheng Hao] rebase to the latest master & update the unittest
    
    (cherry picked from commit 84d79ee9ec47465269f7b0a7971176da93c96f3f)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 64b30be7e4cb86059bbfeb3e2f8f47f41d015862
Author: Michael Armbrust <[email protected]>
Date:   2014-11-21T02:31:02Z

    [SPARK-4413][SQL] Parquet support through datasource API
    
    Goals:
     - Support for accessing parquet using SQL but not requiring Hive (thus 
allowing support of parquet tables with decimal columns)
     - Support for folder based partitioning with automatic discovery of 
available partitions
     - Caching of file metadata
    
    See scaladoc of `ParquetRelation2` for more details.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #3269 from marmbrus/newParquet and squashes the following commits:
    
    1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once.
    645768b [Michael Armbrust] Review comments.
    abd8e2f [Michael Armbrust] Alternative implementation of parquet based on 
the datasources API.
    938019e [Michael Armbrust] Add an experimental interface to data sources 
that exposes catalyst expressions.
    e9d2641 [Michael Armbrust] logging / formatting improvements.
    
    (cherry picked from commit 02ec058efe24348cdd3691b55942e6f0ef138732)
    Signed-off-by: Michael Armbrust <[email protected]>

commit e445d3ce4e4fb9ee3c2feddb9734d541b61c6c01
Author: Davies Liu <[email protected]>
Date:   2014-11-21T03:12:45Z

    add Sphinx as a dependency of building docs
    
    Author: Davies Liu <[email protected]>
    
    Closes #3388 from davies/doc_readme and squashes the following commits:
    
    daa1482 [Davies Liu] add Sphinx dependency
    
    (cherry picked from commit 8cd6eea6298fc8e811dece38c2875e94ff863948)
    Signed-off-by: Patrick Wendell <[email protected]>

commit 668643b8de0958094766fa62e7e2a7a0909f11da
Author: Michael Armbrust <[email protected]>
Date:   2014-11-21T04:34:43Z

    [SPARK-4522][SQL] Parse schema with missing metadata.
    
    This is just a quick fix for 1.2.  SPARK-4523 describes a more complete 
solution.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #3392 from marmbrus/parquetMetadata and squashes the following 
commits:
    
    bcc6626 [Michael Armbrust] Parse schema with missing metadata.
    
    (cherry picked from commit 90a6a46bd11030672597f015dd443d954107123a)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 6f70e0295572e3037660004797040e026e440dbd
Author: zsxwing <[email protected]>
Date:   2014-11-21T08:42:43Z

    [SPARK-4472][Shell] Print "Spark context available as sc." only when 
SparkContext is created...
    
    ... successfully
    
    It's weird that printing "Spark context available as sc" when creating 
SparkContext unsuccessfully.
    
    Author: zsxwing <[email protected]>
    
    Closes #3341 from zsxwing/SPARK-4472 and squashes the following commits:
    
    4850093 [zsxwing] Print "Spark context available as sc." only when 
SparkContext is created successfully
    
    (cherry picked from commit f1069b84b82b932751604bc20d5c2e451d57c455)
    Signed-off-by: Reynold Xin <[email protected]>

commit 6a01689a913a1a223fad66848c4fc17ab2931f22
Author: Patrick Wendell <[email protected]>
Date:   2014-11-21T20:10:04Z

    SPARK-4532: Fix bug in detection of Hive in Spark 1.2
    
    Because the Hive profile is no longer defined in the root pom,
    we need to check specifically in the sql/hive pom when we
    perform the check in make-distribtion.sh.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #3398 from pwendell/make-distribution and squashes the following 
commits:
    
    8a58279 [Patrick Wendell] Fix bug in detection of Hive in Spark 1.2
    
    (cherry picked from commit a81918c5a66fc6040f9796fc1a9d4e0bfb8d0cbe)
    Signed-off-by: Patrick Wendell <[email protected]>

commit 9309ddfc3b9cca3780555fb3ac52d96343cb9545
Author: Davies Liu <[email protected]>
Date:   2014-11-21T23:02:31Z

    [SPARK-4531] [MLlib] cache serialized java object
    
    The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it 
cause much performance regression in 1.2, because we cache the serialized 
Python object in JVM, deserialize them into Java object in each step.
    
    This PR change to cache the deserialized JavaRDD instead of PythonRDD to 
avoid the deserialization of Pyrolite. It should have similar memory usage as 
before, but much faster.
    
    Author: Davies Liu <[email protected]>
    
    Closes #3397 from davies/cache and squashes the following commits:
    
    7f6e6ce [Davies Liu] Update -> Updater
    4b52edd [Davies Liu] using named argument
    63b984e [Davies Liu] fix
    7da0332 [Davies Liu] add unpersist()
    dff33e1 [Davies Liu] address comments
    c2bdfc2 [Davies Liu] refactor
    d572f00 [Davies Liu] Merge branch 'master' into cache
    f1063e1 [Davies Liu] cache serialized java object
    
    (cherry picked from commit ce95bd8e130b2c7688b94be40683bdd90d86012d)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 4b68cabf5894643deb99042268fb5b343e8d31f3
Author: DB Tsai <[email protected]>
Date:   2014-11-22T02:15:07Z

    [SPARK-4431][MLlib] Implement efficient foreachActive for dense and sparse 
vector
    
    Previously, we were using Breeze's activeIterator to access the non-zero 
elements
    in dense/sparse vector. Due to the overhead, we switched back to native 
`while loop`
    in #SPARK-4129.
    
    However, #SPARK-4129 requires de-reference the dv.values/sv.values in
    each access to the value, which is very expensive. Also, in 
MultivariateOnlineSummarizer,
    we're using Breeze's dense vector to store the partial stats, and this is 
very expensive compared
    with using primitive scala array.
    
    In this PR, efficient foreachActive is implemented to unify the code path 
for dense and sparse
    vector operation which makes codebase easier to maintain. Breeze dense 
vector is replaced
    by primitive array to reduce the overhead further.
    
    Benchmarking with mnist8m dataset on single JVM
    with first 200 samples loaded in memory, and repeating 5000 times.
    
    Before change:
    Sparse Vector - 30.02
    Dense Vector - 38.27
    
    With this PR:
    Sparse Vector - 6.29
    Dense Vector - 11.72
    
    Author: DB Tsai <[email protected]>
    
    Closes #3288 from dbtsai/activeIterator and squashes the following commits:
    
    844b0e6 [DB Tsai] formating
    03dd693 [DB Tsai] futher performance tunning.
    1907ae1 [DB Tsai] address feedback
    98448bb [DB Tsai] Made the override final, and had a local copy of 
variables which made the accessing a single step operation.
    c0cbd5a [DB Tsai] fix a bug
    6441f92 [DB Tsai] Finished SPARK-4431
    
    (cherry picked from commit b5d17ef10e2509d9886c660945449a89750c8116)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 1a12ca339cf038c44f5d7402d63851f48a055b35
Author: Sandy Ryza <[email protected]>
Date:   2014-11-24T19:28:48Z

    SPARK-4457. Document how to build for Hadoop versions greater than 2.4
    
    Author: Sandy Ryza <[email protected]>
    
    Closes #3322 from sryza/sandy-spark-4457 and squashes the following commits:
    
    5e72b77 [Sandy Ryza] Feedback
    0cf05c1 [Sandy Ryza] Caveat
    be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions 
greater than 2.4
    
    (cherry picked from commit 29372b63185a4a170178b6ec2362d7112f389852)
    Signed-off-by: Thomas Graves <[email protected]>

commit ee1bc892a32bb969b051b3bc3eaaf9a54af1c7a3
Author: Cheng Lian <[email protected]>
Date:   2014-11-24T20:43:45Z

    [SPARK-4479][SQL] Avoids unnecessary defensive copies when sort based 
shuffle is on
    
    This PR is a workaround for SPARK-4479. Two changes are introduced: when 
merge sort is bypassed in `ExternalSorter`,
    
    1. also bypass RDD elements buffering as buffering is the reason that 
`MutableRow` backed row objects must be copied, and
    2. avoids defensive copies in `Exchange` operator
    
    <!-- Reviewable:start -->
    [<img src="https://reviewable.io/review_button.png"; height=40 alt="Review 
on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422)
    <!-- Reviewable:end -->
    
    Author: Cheng Lian <[email protected]>
    
    Closes #3422 from liancheng/avoids-defensive-copies and squashes the 
following commits:
    
    591f2e9 [Cheng Lian] Passes all shuffle suites
    0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed
    ed5df3c [Cheng Lian] Fixes styling changes
    f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based 
shuffle is on
    
    (cherry picked from commit a6d7b61f92dc7c1f9632cecb232afa8040ab2b4d)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 1e3d22b9fd2c0a87330283c5097b2b7ec95a5715
Author: Daniel Darabos <[email protected]>
Date:   2014-11-24T20:45:07Z

    [SQL] Fix comment in HiveShim
    
    This file is for Hive 0.13.1 I think.
    
    Author: Daniel Darabos <[email protected]>
    
    Closes #3432 from darabos/patch-2 and squashes the following commits:
    
    4fd22ed [Daniel Darabos] Fix comment. This file is for Hive 0.13.1.
    
    (cherry picked from commit d5834f0732b586731034a7df5402c25454770fc5)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 0e7fa7f632ebe4db60938f2087c1f1a4d614ab32
Author: scwf <[email protected]>
Date:   2014-11-24T20:49:08Z

    [SQL] Fix path in HiveFromSpark
    
    It require us to run ```HiveFromSpark``` in specified dir because 
```HiveFromSpark``` use relative path, this leads to ```run-example``` 
error(http://apache-spark-developers-list.1001551.n3.nabble.com/src-main-resources-kv1-txt-not-found-in-example-of-HiveFromSpark-td9100.html).
    
    Author: scwf <[email protected]>
    
    Closes #3415 from scwf/HiveFromSpark and squashes the following commits:
    
    ed3d6c9 [scwf] revert no need change
    b00e20c [scwf] fix path usring spark_home
    dbd321b [scwf] fix path in hivefromspark
    
    (cherry picked from commit b384119304617459592b7ba435368dd6fcc3273e)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 97b7eb4d99613944d39f1421dccc2724c4165c9e
Author: Kousuke Saruta <[email protected]>
Date:   2014-11-24T20:54:37Z

    [SPARK-4487][SQL] Fix attribute reference resolution error when using ORDER 
BY.
    
    When we use ORDER BY clause, at first, attributes referenced by projection 
are resolved (1).
    And then, attributes referenced at ORDER BY clause are resolved (2).
     But when resolving attributes referenced at ORDER BY clause, the 
resolution result generated in (1) is discarded so for example, following query 
fails.
    
        SELECT c1 + c2 FROM mytable ORDER BY c1;
    
    The query above fails because when resolving the attribute reference 'c1', 
the resolution result of 'c2' is discarded.
    
    Author: Kousuke Saruta <[email protected]>
    
    Closes #3363 from sarutak/SPARK-4487 and squashes the following commits:
    
    fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer
    6e60c20 [Kousuke Saruta] Fixed conflicts
    cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487
    282d529 [Kousuke Saruta] Fixed attributes reference resolution error
    b6123e6 [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into concat-feature
    317b7fb [Kousuke Saruta] WIP
    
    (cherry picked from commit dd1c9cb36cde8202cede8014b5641ae8a0197812)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 2d35cc0852e5ce426b143b51d03a71f16ad06c11
Author: Josh Rosen <[email protected]>
Date:   2014-11-24T21:18:14Z

    [SPARK-4145] Web UI job pages
    
    This PR adds two new pages to the Spark Web UI:
    
    - A jobs overview page, which shows details on running / completed / failed 
jobs.
    - A job details page, which displays information on an individual job's 
stages.
    
    The jobs overview page is now the default UI homepage; the old homepage is 
still accessible at `/stages`.
    
    ### Screenshots
    
    #### New UI homepage
    
    
![image](https://cloud.githubusercontent.com/assets/50748/5119035/fd0a69e6-701f-11e4-89cb-db7e9705714f.png)
    
    #### Job details page
    
    (This is effectively a per-job version of the stages page that can be 
extended later with other things, such as DAG visualizations)
    
    
![image](https://cloud.githubusercontent.com/assets/50748/5134910/50b340d4-70c7-11e4-88e1-6b73237ea7c8.png)
    
    ### Key changes in this PR
    
    - Rename `JobProgressPage` to `AllStagesPage`
    - Expose `StageInfo` objects in the ``SparkListenerJobStart` event; add 
backwards-compatibility tests to JsonProtocol.
    - Add additional data structures to `JobProgressListener` to map from 
stages to jobs.
    - Add several fields to `JobUIData`.
    
    I also added ~150 lines of Selenium tests as I uncovered UI issues while 
developing this patch.
    
    ### Limitations
    
    If a job contains stages that aren't run, then its overall job progress bar 
may be an underestimate of the total job progress; in other words, a completed 
job may appear to have a progress bar that's not at 100%.
    
    If stages or tasks fail, then the progress bar will not go backwards to 
reflect the true amount of remaining work.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #3009 from JoshRosen/job-page and squashes the following commits:
    
    eb05e90 [Josh Rosen] Disable kill button in completed stages tables.
    f00c851 [Josh Rosen] Fix JsonProtocol compatibility
    b89c258 [Josh Rosen] More JSON protocol backwards-compatibility fixes.
    ff804cd [Josh Rosen] Don't write "Stage Ids" field in JobStartEvent JSON.
    6f17f3f [Josh Rosen] Only store StageInfos in SparkListenerJobStart event.
    2bbf41a [Josh Rosen] Update job progress bar to reflect skipped 
tasks/stages.
    61c265a [Josh Rosen] Add âskipped stagesâ table; only display non-empty 
tables.
    1f45d44 [Josh Rosen] Incorporate a bunch of minor review feedback.
    0b77e3e [Josh Rosen] More bug fixes for phantom stages.
    034aa8d [Josh Rosen] Use `.max()` to find result stage for job.
    eebdc2c [Josh Rosen] Donât display pending stages for completed jobs.
    67080ba [Josh Rosen] Ensure that "phantom stages" don't cause memory leaks.
    7d10b97 [Josh Rosen] Merge remote-tracking branch 'apache/master' into 
job-page
    d69c775 [Josh Rosen] Fix table sorting on all jobs page.
    5eb39dc [Josh Rosen] Add pending stages table to job page.
    f2a15da [Josh Rosen] Add status field to job details page.
    171b53c [Josh Rosen] Move `startTime` to the start of SparkContext.
    e2f2c43 [Josh Rosen] Fix sorting of stages in job details page.
    8955f4c [Josh Rosen] Display information for pending stages on jobs page.
    8ab6c28 [Josh Rosen] Compute numTasks from job start stage infos.
    5884f91 [Josh Rosen] Add StageInfos to SparkListenerJobStart event.
    79793cd [Josh Rosen] Track indices of completed stage to avoid overcounting 
when failures occur.
    d62ea7b [Josh Rosen] Add failing Selenium test for stage overcounting issue.
    1145c60 [Josh Rosen] Display text instead of progress bar for stages.
    3d0a007 [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
job-page
    8a2351b [Josh Rosen] Add help tooltip to Spark Jobs page.
    b7bf30e [Josh Rosen] Add stages progress bar; fix bug where active stages 
show as completed.
    4846ce4 [Josh Rosen] Hide "(Job Group") if no jobs were submitted in job 
groups.
    4d58e55 [Josh Rosen] Change label to "Tasks (for all stages)"
    85e9c85 [Josh Rosen] Extract startTime into separate variable.
    1cf4987 [Josh Rosen] Fix broken kill links; add Selenium test to avoid 
future regressions.
    56701fa [Josh Rosen] Move last stage name / description logic out of markup.
    a475ea1 [Josh Rosen] Add progress bars to jobs page.
    45343b8 [Josh Rosen] More comments
    4b206fb [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
job-page
    bfce2b9 [Josh Rosen] Address review comments, except for progress bar.
    4487dcb [Josh Rosen] [SPARK-4145] Web UI job pages
    2568a6c [Josh Rosen] Rename JobProgressPage to AllStagesPage:
    
    (cherry picked from commit 4a90276ab22d6989dffb2ee2d8118d9253365646)
    Signed-off-by: Patrick Wendell <[email protected]>

commit 6fa3e415d419ee9b2f3d14106a714b627e251e7d
Author: Tathagata Das <[email protected]>
Date:   2014-11-24T21:50:20Z

    [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files 
from being processed multiple times
    
    Because of a corner case, a file already selected for batch t can get 
considered again for batch t+2. This refactoring fixes it by remembering all 
the files selected in the last 1 minute, so that this corner case does not 
arise. Also uses spark context's hadoop configuration to access the file system 
API for listing directories.
    
    pwendell Please take look. I still have not run long-running integration 
tests, so I cannot say for sure whether this has indeed solved the issue. You 
could do a first pass on this in the meantime.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #3419 from tdas/filestream-fix2 and squashes the following commits:
    
    c19dd8a [Tathagata Das] Addressed PR comments.
    513b608 [Tathagata Das] Updated docs.
    d364faf [Tathagata Das] Added the current time condition back
    5526222 [Tathagata Das] Removed unnecessary imports.
    38bb736 [Tathagata Das] Fix long line.
    203bbc7 [Tathagata Das] Un-ignore tests.
    eaef4e1 [Tathagata Das] Fixed SPARK-4519
    9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few 
batches.
    
    (cherry picked from commit cb0e9b0980f38befe88bf52aa037fe33262730f7)
    Signed-off-by: Tathagata Das <[email protected]>

commit 9ea67fc1ddd2aca70f6e2da38ebaf7ebc2398981
Author: Davies Liu <[email protected]>
Date:   2014-11-25T00:37:14Z

    [SPARK-4562] [MLlib] speedup vector
    
    This PR change the underline array of DenseVector to numpy.ndarray to avoid 
the conversion, because most of the users will using numpy.array.
    
    It also improve the serialization of DenseVector.
    
    Before this change:
    
    trial       | trainingTime |        testTime
    -------|--------|--------
    0   | 5.126 |       1.786
    1   |2.698  |1.693
    
    After the change:
    
    trial       | trainingTime |        testTime
    -------|--------|--------
    0   |4.692  |0.554
    1   |2.307  |0.525
    
    This could partially fix the performance regression during test.
    
    Author: Davies Liu <[email protected]>
    
    Closes #3420 from davies/ser2 and squashes the following commits:
    
    0e1e6f3 [Davies Liu] fix tests
    426f5db [Davies Liu] impove toArray()
    44707ec [Davies Liu] add name for ISO-8859-1
    fa7d791 [Davies Liu] address comments
    1cfb137 [Davies Liu] handle zero sparse vector
    2548ee2 [Davies Liu] fix tests
    9e6389d [Davies Liu] bugfix
    470f702 [Davies Liu] speed up DenseMatrix
    f0d3c40 [Davies Liu] speedup SparseVector
    ef6ce70 [Davies Liu] speed up dense vector
    
    (cherry picked from commit b660de7a9cbdea3df4a37fbcf60c1c33c71782b8)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 2acbd2884f73c4503d753bb96e0acf75cd237536
Author: tkaessmann <[email protected]>
Date:   2014-11-25T00:40:19Z

    get raw vectors for further processing in Word2Vec
    
    e.g. clustering
    
    Author: tkaessmann <[email protected]>
    
    Closes #3309 from tkaessmann/branch-1.2 and squashes the following commits:
    
    e3a3142 [tkaessmann] changes the comment for getVectors
    58d3d83 [tkaessmann] removes sign from comment
    a5be213 [tkaessmann] fixes getVectors to fit code guidelines
    3782fa9 [tkaessmann] get raw vectors for further processing

commit 8371bc20821c39ee6d8116a867577e5c0fcd08ab
Author: Davies Liu <[email protected]>
Date:   2014-11-25T00:41:23Z

    [SPARK-4578] fix asDict() with nested Row()
    
    The Row object is created on the fly once the field is accessed, so we 
should access them by getattr() in asDict(0
    
    Author: Davies Liu <[email protected]>
    
    Closes #3434 from davies/fix_asDict and squashes the following commits:
    
    b20f1e7 [Davies Liu] fix asDict() with nested Row()
    
    (cherry picked from commit 050616b408c60eae02256913ceb645912dbff62e)
    Signed-off-by: Patrick Wendell <[email protected]>

commit 841f247a55df8b7f7252ab1b8067a1ea9aa45633
Author: Davies Liu <[email protected]>
Date:   2014-11-25T01:17:03Z

    [SPARK-4548] []SPARK-4517] improve performance of python broadcast
    
    Re-implement the Python broadcast using file:
    
    1) serialize the python object using cPickle, write into disks.
    2) Create a wrapper in JVM (for the dumped file), it read data from during 
serialization
    3) Using TorrentBroadcast or HttpBroadcast to transfer the data 
(compressed) into executors
    4) During deserialization, writing the data into disk.
    5) Passing the path into Python worker, read data from disk and unpickle it 
into python object, until the first access.
    
    It fixes the performance regression introduced in #2659, has similar 
performance as 1.1, but support object larger than 2G, also improve the memory 
efficiency (only one compressed copy in driver and executor).
    
    Testing with a 500M broadcast and 4 tasks (excluding the benefit from 
reused worker in 1.2):
    
             name |   1.1   | 1.2 with this patch |  improvement
    ---------|--------|---------|--------
          python-broadcast-w-bytes  |   25.20  |        9.33   |        170.13% 
|
            python-broadcast-w-set        |     4.13       |    4.50  | -8.35%  
|
    
    Testing with 100 tasks (16 CPUs):
    
             name |   1.1   | 1.2 with this patch |  improvement
    ---------|--------|---------|--------
         python-broadcast-w-bytes       | 38.16 | 8.40   | 353.98%
            python-broadcast-w-set      | 23.29 | 9.59 |        142.80%
    
    Author: Davies Liu <[email protected]>
    
    Closes #3417 from davies/pybroadcast and squashes the following commits:
    
    50a58e0 [Davies Liu] address comments
    b98de1d [Davies Liu] disable gc while unpickle
    e5ee6b9 [Davies Liu] support large string
    09303b8 [Davies Liu] read all data into memory
    dde02dd [Davies Liu] improve performance of python broadcast
    
    (cherry picked from commit 6cf507685efd01df77d663145ae08e48c7f92948)
    Signed-off-by: Josh Rosen <[email protected]>

commit 47d4fceffe90905fa8f50551e53c8d2e5b246cae
Author: Kay Ousterhout <[email protected]>
Date:   2014-11-25T02:03:10Z

    [SPARK-4266] [Web-UI] Reduce stage page load time.
    
    The commit changes the java script used to show/hide additional
    metrics in order to reduce page load time. SPARK-4016 significantly
    increased page load time for the stage page when stages had a lot
    (thousands or tens of thousands) of tasks, due to the additional
    Javascript to hide some metrics by default and stripe the tables.
    This commit reduces page load time in two ways:
    
    (1) Now, all of the metrics that are hidden by default are
    hidden by setting "display: none;" using CSS for the page,
    rather than hiding them using javascript after the page loads.
    Without this change, for stages with thousands of tasks, there
    was a few second delay after page load, where first the additional
    metrics were shown, and then after a delay were hidden once the
    relevant JS finished running.
    
    (2) CSS is used to stripe all of the tables except for the summary
    table. The summary table needs javascript to do the striping because
    some rows are hidden, but the javascript striping is slower, which
    again resulted in a delay when it was used for the task table (where
    for a few seconds after page load, all of the rows in the task table
    would be white, while the browser finished running the JS to stripe
    the table).
    
    cc pwendell
    
    This change is intended to be backported to 1.2 to avoid a regression in
    UI performance when users run large jobs.
    
    Author: Kay Ousterhout <[email protected]>
    
    Closes #3328 from kayousterhout/SPARK-4266 and squashes the following 
commits:
    
    f964091 [Kay Ousterhout] [SPARK-4266] [Web-UI] Reduce stage page load time.
    
    (cherry picked from commit d24d5bf064572a2319627736b1fbf112b4a78edf)
    Signed-off-by: Kay Ousterhout <[email protected]>

commit 4b4797309457b9301710b6e98550817337005eca
Author: Patrick Wendell <[email protected]>
Date:   2014-11-25T03:14:14Z

    [SPARK-4525] Mesos should decline unused offers
    
    Functionally, this is just a small change on top of #3393 (by jongyoul). 
The issue being addressed is discussed in the comments there. I have not yet 
added a test for the bug there. I will add one shortly.
    
    I've also done some minor renaming/clean-up of variables in this class and 
tests.
    
    Author: Patrick Wendell <[email protected]>
    Author: Jongyoul Lee <[email protected]>
    
    Closes #3436 from pwendell/mesos-issue and squashes the following commits:
    
    58c35b5 [Patrick Wendell] Adding unit test for this situation
    c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing 
fix
    f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers 
cannot decline unused offers from acceptedOffers - Added code for declining 
unused offers among acceptedOffers - Edited testCase for checking declining 
unused offers
    
    (cherry picked from commit b043c27424d05e3200e7ba99a1a65656b57fa2f0)
    Signed-off-by: Patrick Wendell <[email protected]>

commit e7b8bf067a2606e381f2081db95d9c613391afef
Author: Patrick Wendell <[email protected]>
Date:   2014-11-25T03:20:09Z

    Revert "[SPARK-4525] Mesos should decline unused offers"
    
    This reverts commit 4b4797309457b9301710b6e98550817337005eca.
    
    I accidentally committed this using my own authorship credential. However,
    I should have given authoriship to the original author: Jongyoul Lee.

commit 10e433919a9a3520007099a3876b47f74c046f12
Author: Jongyoul Lee <[email protected]>
Date:   2014-11-25T03:14:14Z

    [SPARK-4525] Mesos should decline unused offers
    
    Functionally, this is just a small change on top of #3393 (by jongyoul). 
The issue being addressed is discussed in the comments there. I have not yet 
added a test for the bug there. I will add one shortly.
    
    I've also done some minor renaming/clean-up of variables in this class and 
tests.
    
    Author: Patrick Wendell <[email protected]>
    Author: Jongyoul Lee <[email protected]>
    
    Closes #3436 from pwendell/mesos-issue and squashes the following commits:
    
    58c35b5 [Patrick Wendell] Adding unit test for this situation
    c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing 
fix
    f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers 
cannot decline unused offers from acceptedOffers - Added code for declining 
unused offers among acceptedOffers - Edited testCase for checking declining 
unused offers
    
    (cherry picked from commit b043c27424d05e3200e7ba99a1a65656b57fa2f0)
    Signed-off-by: Patrick Wendell <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Branch 1.2

Reply via email to