[GitHub] spark pull request: SPARK-3716 [GraphX] Update Analytics.scala for...

NamelessAnalyst Sun, 28 Sep 2014 18:32:49 -0700

GitHub user NamelessAnalyst opened a pull request:

    https://github.com/apache/spark/pull/2568


    SPARK-3716 [GraphX] Update Analytics.scala for partitionStrategy assignment

    Previously, when the val partitionStrategy was created it called a function 
in the Analytics object which was a copy of the PartitionStrategy.fromString() 
method. This function has been removed, and the assignment of partitionStrategy 
now uses the PartitionStrategy.fromString method instead. In this way, it 
better matches the declarations of edge/vertex StorageLevel variables. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NamelessAnalyst/spark branch-1.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2568.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2568
    
----
commit 8c79574462eed113fc59d4323eedfc55c6e95c06
Author: Cheng Lian <[email protected]>
Date:   2014-08-16T18:26:51Z

    [SQL] Using safe floating-point numbers in doctest
    
    Test code in `sql.py` tries to compare two floating-point numbers directly, 
and cased [build 
failure(s)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18365/consoleFull).
    
    [Doctest 
documentation](https://docs.python.org/3/library/doctest.html#warnings) 
recommends using numbers in the form of `I/2**J` to avoid the precision issue.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #1925 from liancheng/fix-pysql-fp-test and squashes the following 
commits:
    
    0fbf584 [Cheng Lian] Removed unnecessary `...' from inferSchema doctest
    e8059d4 [Cheng Lian] Using safe floating-point numbers in doctest
    
    (cherry picked from commit b4a05928e95c0f6973fd21e60ff9c108f226e38c)
    Signed-off-by: Michael Armbrust <[email protected]>

commit bd3ce2ffb8964abb4d59918ebb2c230fe4614aa2
Author: Kousuke Saruta <[email protected]>
Date:   2014-08-16T21:15:58Z

    [SPARK-2677] BasicBlockFetchIterator#next can wait forever
    
    Author: Kousuke Saruta <[email protected]>
    
    Closes #1632 from sarutak/SPARK-2677 and squashes the following commits:
    
    cddbc7b [Kousuke Saruta] Removed Exception throwing when 
ConnectionManager#handleMessage receives ack for non-referenced message
    d3bd2a8 [Kousuke Saruta] Modified configuration.md for 
spark.core.connection.ack.timeout
    e85f88b [Kousuke Saruta] Removed useless synchronized blocks
    7ed48be [Kousuke Saruta] Modified ConnectionManager to use 
ackTimeoutMonitor ConnectionManager-wide
    9b620a6 [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into SPARK-2677
    0dd9ad3 [Kousuke Saruta] Modified typo in ConnectionManagerSuite.scala
    7cbb8ca [Kousuke Saruta] Modified to match with scalastyle
    8a73974 [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into SPARK-2677
    ade279a [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into SPARK-2677
    0174d6a [Kousuke Saruta] Modified ConnectionManager.scala to handle the 
case remote Executor cannot ack
    a454239 [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into SPARK-2677
    9b7b7c1 [Kousuke Saruta] (WIP) Modifying ConnectionManager.scala
    
    (cherry picked from commit 76fa0eaf515fd6771cdd69422b1259485debcae5)
    Signed-off-by: Josh Rosen <[email protected]>

commit 0b354be2f9ec35547a60591acf4f4773a4869690
Author: Xiangrui Meng <[email protected]>
Date:   2014-08-16T22:13:34Z

    [SPARK-3048][MLLIB] add LabeledPoint.parse and remove 
loadStreamingLabeledPoints
    
    Move `parse()` from `LabeledPointParser` to `LabeledPoint` and make it 
public. This breaks binary compatibility only when a user uses synthesized 
methods like `tupled` and `curried`, which is rare.
    
    `LabeledPoint.parse` is more consistent with `Vectors.parse`, which is why 
`LabeledPointParser` is not preferred.
    
    freeman-lab tdas
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #1952 from mengxr/labelparser and squashes the following commits:
    
    c818fb2 [Xiangrui Meng] merge master
    ce20e6f [Xiangrui Meng] update mima excludes
    b386b8d [Xiangrui Meng] fix tests
    2436b3d [Xiangrui Meng] add parse() to LabeledPoint
    
    (cherry picked from commit 7e70708a99949549adde00cb6246a9582bbc4929)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit a12d3ae3223535e6e4c774e4a289b8b2f2e5228b
Author: Xiangrui Meng <[email protected]>
Date:   2014-08-16T22:14:43Z

    [SPARK-3081][MLLIB] rename RandomRDDGenerators to RandomRDDs
    
    `RandomRDDGenerators` means factory for `RandomRDDGenerator`. However, its 
methods return RDDs but not RDDGenerators. So a more proper (and shorter) name 
would be `RandomRDDs`.
    
    dorx brkyvz
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #1979 from mengxr/randomrdds and squashes the following commits:
    
    b161a2d [Xiangrui Meng] rename RandomRDDGenerators to RandomRDDs
    
    (cherry picked from commit ac6411c6e75906997c78de23dfdbc8d225b87cfd)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 721f2fdc95032132af3d4a00dbc8399d356f8faf
Author: iAmGhost <[email protected]>
Date:   2014-08-16T23:48:38Z

    [SPARK-3035] Wrong example with SparkContext.addFile
    
    https://issues.apache.org/jira/browse/SPARK-3035
    
    fix for wrong document.
    
    Author: iAmGhost <[email protected]>
    
    Closes #1942 from iAmGhost/master and squashes the following commits:
    
    487528a [iAmGhost] [SPARK-3035] Wrong example with SparkContext.addFile fix 
for wrong document.
    
    (cherry picked from commit 379e7585c356f20bf8b4878ecba9401e2195da12)
    Signed-off-by: Josh Rosen <[email protected]>

commit 5dd571c29ef97cadd23a54fcf4d5de869e3c56bc
Author: Davies Liu <[email protected]>
Date:   2014-08-16T23:59:34Z

    [SPARK-1065] [PySpark] improve supporting for large broadcast
    
    Passing large object by py4j is very slow (cost much memory), so pass 
broadcast objects via files (similar to parallelize()).
    
    Add an option to keep object in driver (it's False by default) to save 
memory in driver.
    
    Author: Davies Liu <[email protected]>
    
    Closes #1912 from davies/broadcast and squashes the following commits:
    
    e06df4a [Davies Liu] load broadcast from disk in driver automatically
    db3f232 [Davies Liu] fix serialization of accumulator
    631a827 [Davies Liu] Merge branch 'master' into broadcast
    c7baa8c [Davies Liu] compress serrialized broadcast and command
    9a7161f [Davies Liu] fix doc tests
    e93cf4b [Davies Liu] address comments: add test
    6226189 [Davies Liu] improve large broadcast
    
    (cherry picked from commit 2fc8aca086a2679b854038b7e2c488f19039ecbd)
    Signed-off-by: Josh Rosen <[email protected]>

commit f02e327f0bc975e7f33092e449bc0edd95f95580
Author: GuoQiang Li <[email protected]>
Date:   2014-08-17T03:05:55Z

    In the stop method of ConnectionManager to cancel the ackTimeoutMonitor
    
    cc JoshRosen sarutak
    
    Author: GuoQiang Li <[email protected]>
    
    Closes #1989 from witgo/cancel_ackTimeoutMonitor and squashes the following 
commits:
    
    4a700fa [GuoQiang Li] In the stop method of ConnectionManager to cancel the 
ackTimeoutMonitor
    
    (cherry picked from commit bc95fe08dff62a0abea314ab4ab9275c8f119598)
    Signed-off-by: Josh Rosen <[email protected]>

commit 413a329e186de2ec96f80f614c36678bee6f332f
Author: Xiangrui Meng <[email protected]>
Date:   2014-08-17T04:16:27Z

    [SPARK-3077][MLLIB] fix some chisq-test
    
    - promote nullHypothesis field in ChiSqTestResult to TestResult. Every test 
should have a null hypothesis
    - correct null hypothesis statement for independence test
    - p-value: 0.01 -> 0.1
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #1982 from mengxr/fix-chisq and squashes the following commits:
    
    5f0de02 [Xiangrui Meng] make ChiSqTestResult constructor package private
    bc74ea1 [Xiangrui Meng] update chisq-test
    
    (cherry picked from commit fbad72288d8b6e641b00417a544cae6e8bfef2d7)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 91af120b4391656cb8f7b2300202dc622c032c33
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-17T06:53:14Z

    [SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up
    
    DecisionTree needs to match each example to a node at each iteration.  It 
currently does this with a set of filters very inefficiently: For each example, 
it examines each node at the current level and traces up to the root to see if 
that example should be handled by that node.
    
    Fix: Filter top-down using the partly built tree itself.
    
    Major changes:
    * Eliminated Filter class, findBinsForLevel() method.
    * Set up node parent links in main loop over levels in train().
    * Added predictNodeIndex() for filtering top-down.
    * Added DTMetadata class
    
    Other changes:
    * Pre-compute set of unorderedFeatures.
    
    Notes for following expected PR based on 
[https://issues.apache.org/jira/browse/SPARK-3043]:
    * The unorderedFeatures set will next be stored in a metadata structure to 
simplify function calls (to store other items such as the data in strategy).
    
    I've done initial tests indicating that this speeds things up, but am only 
now running large-scale ones.
    
    CC: mengxr manishamde chouqin  Any comments are welcome---thanks!
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #1975 from jkbradley/dt-opt2 and squashes the following commits:
    
    a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  
Small doc updates.
    3726d20 [Joseph K. Bradley] Small code improvements based on code review.
    ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main 
change: Now using << instead of math.pow.
    db0d773 [Joseph K. Bradley] scala style fix
    6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
    931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt2
    797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second 
level.  Needed to update treePointToNodeIndex with groupShift.
    f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
    5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
    6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: 
persisting to memory + disk, not just memory.
    2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt1
    26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer 
used.  Removed debugging println calls in DecisionTree.scala.
    356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
    430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added 
some docs.
    d036089 [Joseph K. Bradley] Print timing info to logDebug.
    e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods 
private
    8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own 
file, and cleaned it up.  Removed debugging println calls from DecisionTree.  
Made TreePoint extend Serialiable
    a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt1
    c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: 
Updated calculateGainForSplit to take aggregates for a single (feature, split) 
pair. * Internal doc: findAggForOrderedFeatureClassification
    b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + 
small changes
    b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt
    0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
    3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint 
representation to avoid calling findBin multiple times. * (not working yet, but 
debugging)
    f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-timing
    bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-timing
    511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-timing
    a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
    
    (cherry picked from commit 73ab7f141c205df277c6ac19252e590d6806c41f)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit d411f4190252546b0ea99c1934efd5e5f84be50c
Author: Patrick Wendell <[email protected]>
Date:   2014-08-17T22:48:39Z

    SPARK-2881: Upgrade to Snappy 1.0.5.3 to avoid SPARK-2881.
    
    This version of Snappy was released with a backported fix specifically
    for Spark. This fixes an issue where names collide in the snappy .so
    file when users are submitting jobs as different users on the same
    cluster.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #1999 from pwendell/snappy-upgrade and squashes the following 
commits:
    
    38974ff [Patrick Wendell] SPARK-2881: Upgrade to Snappy 1.0.5.3 to avoid 
SPARK-2881.

commit c6a0091ea401e0bec58d7607eb42be89cc090868
Author: Michael Armbrust <[email protected]>
Date:   2014-08-18T01:10:45Z

    Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when 
EventLogging is enabled"
    
    Revert #1891 due to issues with hadoop 1 compatibility.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #2007 from marmbrus/revert1891 and squashes the following commits:
    
    68706c0 [Michael Armbrust] Revert "[SPARK-2970] [SQL] spark-sql script ends 
with IOException when EventLogging is enabled"
    
    (cherry picked from commit 5ecb08ea063166564178885b7515abef0d76eecb)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 4f776dfab726f54c948a83a7157b958903c15ecf
Author: Michael Armbrust <[email protected]>
Date:   2014-08-18T02:00:38Z

    [SQL] Improve debug logging and toStrings.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #2004 from marmbrus/codgenDebugging and squashes the following 
commits:
    
    b7a7e41 [Michael Armbrust] Improve debug logging and toStrings.
    
    (cherry picked from commit bfa09b01d7eddc572cd22ca2e418a735b4ccc826)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 826356725ffb3189180f7879d3f9c449924785f3
Author: Chris Fregly <[email protected]>
Date:   2014-08-18T02:33:15Z

    [SPARK-1981] updated streaming-kinesis.md
    
    fixed markup, separated out sections more-clearly, more thorough 
explanations
    
    Author: Chris Fregly <[email protected]>
    
    Closes #1757 from cfregly/master and squashes the following commits:
    
    9b1c71a [Chris Fregly] better explained why spark checkpoints are disabled 
in the example (due to no stateful operations being used)
    0f37061 [Chris Fregly] SPARK-1981:  (Kinesis streaming support) updated 
streaming-kinesis.md
    862df67 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    8e1ae2e [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be 
more clear, removed retries around store() method
    0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back 
into extras/kinesis-asl
    691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with 
JavaKinesisWordCount during union of streams
    0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, 
updated docs
    e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples 
into the examples/ dir
    d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the 
KinesisUtils api
    912640c [Chris Fregly] changed the foundKinesis class to be a 
publically-avail class
    db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and 
kinesis client
    338997e [Chris Fregly] improve build docs for kinesis
    828f8ae [Chris Fregly] more cleanup
    e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    cd68c0d [Chris Fregly] fixed typos and backward compatibility
    d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
    
    (cherry picked from commit 99243288b049f4a4fb4ba0505ea2310be5eb4bd2)
    Signed-off-by: Tathagata Das <[email protected]>

commit 8438daf2c2a04e48465fc2681d142ca5a6dec747
Author: Xiangrui Meng <[email protected]>
Date:   2014-08-18T03:53:18Z

    [SPARK-3087][MLLIB] fix col indexing bug in chi-square and add a check for 
number of distinct values
    
    There is a bug determining the column index. dorx
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #1997 from mengxr/chisq-index and squashes the following commits:
    
    8fc2ab2 [Xiangrui Meng] fix col indexing bug and add a check for number of 
distinct values
    
    (cherry picked from commit c77f40668fbb5b8bca9a9b25c039895cb7a4a80c)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit a5ae720745d744ec29741b49d2d362f362d53fa4
Author: Patrick Wendell <[email protected]>
Date:   2014-08-18T05:29:58Z

    SPARK-2884: Create binary builds in parallel with release script.

commit 0506539b0e853d474183078814fb0f550bfbbd67
Author: Sandy Ryza <[email protected]>
Date:   2014-08-18T05:39:06Z

    SPARK-2900. aggregate inputBytes per stage
    
    Author: Sandy Ryza <[email protected]>
    
    Closes #1826 from sryza/sandy-spark-2900 and squashes the following commits:
    
    43f9091 [Sandy Ryza] SPARK-2900
    (cherry picked from commit df652ea02a3e42d987419308ef14874300347373)
    
    Signed-off-by: Patrick Wendell <[email protected]>

commit 708cde99a142c90f5a06c7aa326b622d80022e3d
Author: Liquan Pei <[email protected]>
Date:   2014-08-18T06:29:44Z

    [SPARK-3097][MLlib] Word2Vec performance improvement
    
    mengxr Please review the code. Adding weights in reduceByKey soon.
    
    Only output model entry for words appeared in the partition before merging 
and use reduceByKey to combine model. In general, this implementation is 30s or 
so faster than implementation using big array.
    
    Author: Liquan Pei <[email protected]>
    
    Closes #1932 from Ishiihara/Word2Vec-improve2 and squashes the following 
commits:
    
    d5377a9 [Liquan Pei] use syn0Global and syn1Global to represent model
    cad2011 [Liquan Pei] bug fix for synModify array out of bound
    083aa66 [Liquan Pei] update synGlobal in place and reduce synOut size
    9075e1c [Liquan Pei] combine syn0Global and syn1Global to synGlobal
    aa2ab36 [Liquan Pei] use reduceByKey to combine models
    
    (cherry picked from commit 3c8fa505900ac158d57de36f6b0fd6da05f8893b)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 518258f1ba4d79a72e1a97ebebb1b51cd392c503
Author: Liquan Pei <[email protected]>
Date:   2014-08-18T06:30:47Z

    [SPARK-2842][MLlib]Word2Vec documentation
    
    mengxr
    Documentation for Word2Vec
    
    Author: Liquan Pei <[email protected]>
    
    Closes #2003 from Ishiihara/Word2Vec-doc and squashes the following commits:
    
    4ff11d4 [Liquan Pei] minor fix
    8d7458f [Liquan Pei] code reformat
    6df0dcb [Liquan Pei] add Word2Vec documentation
    
    (cherry picked from commit eef779b8d631de971d440051cae21040f4de558f)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit e0bc333b6ad36feac5397600fe6948dcb37a8e44
Author: Liquan Pei <[email protected]>
Date:   2014-08-18T08:15:45Z

    [MLlib] Remove transform(dataset: RDD[String]) from Word2Vec public API
    
    mengxr
    Remove  transform(dataset: RDD[String]) from public API.
    
    Author: Liquan Pei <[email protected]>
    
    Closes #2010 from Ishiihara/Word2Vec-api and squashes the following commits:
    
    17b1031 [Liquan Pei] remove transform(dataset: RDD[String]) from public API
    
    (cherry picked from commit 9306b8c6c8c412b9d0d5cffb6bd7a87784f0f6bf)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 12f16ba3fa1f3cde9f43c094029017f4192b1bac
Author: Chandan Kumar <[email protected]>
Date:   2014-08-18T16:52:25Z

    [SPARK-2862] histogram method fails on some choices of bucketCount
    
    Author: Chandan Kumar <[email protected]>
    
    Closes #1787 from nrchandan/spark-2862 and squashes the following commits:
    
    a76bbf6 [Chandan Kumar] [SPARK-2862] Fix for a broken test case and add new 
test cases
    4211eea [Chandan Kumar] [SPARK-2862] Add Scala bug id
    13854f1 [Chandan Kumar] [SPARK-2862] Use shorthand range notation to avoid 
Scala bug
    
    (cherry picked from commit f45efbb8aaa65bc46d65e77e93076fbc29f4455d)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit ec0b91edd592cf89be349e0e5ad7553e02f70cd3
Author: Patrick Wendell <[email protected]>
Date:   2014-08-18T17:00:46Z

    SPARK-3096: Include parquet hive serde by default in build
    
    A small change - we should just add this dependency. It doesn't have any 
recursive deps and it's needed for reading have parquet tables.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #2009 from pwendell/parquet and squashes the following commits:
    
    e411f9f [Patrick Wendell] SPARk-309: Include parquet hive serde by default 
in build
    
    (cherry picked from commit 7ae28d1247e4756219016206c51fec1656e3917b)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 55e9dd637bdef3a2acf56af95410219e23c9502a
Author: Matei Zaharia <[email protected]>
Date:   2014-08-18T17:05:52Z

    [SPARK-3084] [SQL] Collect broadcasted tables in parallel in joins
    
    BroadcastHashJoin has a broadcastFuture variable that tries to collect
    the broadcasted table in a separate thread, but this doesn't help
    because it's a lazy val that only gets initialized when you attempt to
    build the RDD. Thus queries that broadcast multiple tables would collect
    and broadcast them sequentially. I changed this to a val to let it start
    collecting right when the operator is created.
    
    Author: Matei Zaharia <[email protected]>
    
    Closes #1990 from mateiz/spark-3084 and squashes the following commits:
    
    f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel 
in joins
    
    (cherry picked from commit 6a13dca12fac06f3af892ffcc8922cc84f91b786)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 4da76fc81c224b04bd652c4a72fb77516a32de0c
Author: Matei Zaharia <[email protected]>
Date:   2014-08-18T17:45:24Z

    [SPARK-3085] [SQL] Use compact data structures in SQL joins
    
    This reuses the CompactBuffer from Spark Core to save memory and pointer
    dereferences. I also tried AppendOnlyMap instead of java.util.HashMap
    but unfortunately that slows things down because it seems to do more
    equals() calls and the equals on GenericRow, and especially JoinedRow,
    is pretty expensive.
    
    Author: Matei Zaharia <[email protected]>
    
    Closes #1993 from mateiz/spark-3085 and squashes the following commits:
    
    188221e [Matei Zaharia] Remove unneeded import
    5f903ee [Matei Zaharia] [SPARK-3085] [SQL] Use compact data structures in 
SQL joins
    
    (cherry picked from commit 4bf3de71074053af94f077c99e9c65a1962739e1)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 496f62d9a98067256d8a51fd1e7a485ff6492fa8
Author: Patrick Wendell <[email protected]>
Date:   2014-08-18T17:52:20Z

    SPARK-3025 [SQL]: Allow JDBC clients to set a fair scheduler pool
    
    This definitely needs review as I am not familiar with this part of Spark.
    I tested this locally and it did seem to work.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #1937 from pwendell/scheduler and squashes the following commits:
    
    b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair 
scheduler pool
    
    (cherry picked from commit 6bca8898a1aa4ca7161492229bac1748b3da2ad7)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 2ae2857986e94d5a8bd5f4660eabe5689463bd21
Author: Matei Zaharia <[email protected]>
Date:   2014-08-18T18:00:10Z

    [SPARK-3091] [SQL] Add support for caching metadata on Parquet files
    
    For larger Parquet files, reading the file footers (which is done in 
parallel on up to 5 threads) and HDFS block locations (which is serial) can 
take multiple seconds. We can add an option to cache this data within 
FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches 
footers within each instance of ParquetInputFormat, not across them.
    
    Note: this PR leaves this turned off by default for 1.1, but I believe it's 
safe to turn it on after. The keys in the hash maps are FileStatus objects that 
include a modification time, so this will work fine if files are modified. The 
location cache could become invalid if files have moved within HDFS, but that's 
rare so I just made it invalidate entries every 15 minutes.
    
    Author: Matei Zaharia <[email protected]>
    
    Closes #2005 from mateiz/parquet-cache and squashes the following commits:
    
    dae8efe [Matei Zaharia] Bug fix
    c71e9ed [Matei Zaharia] Handle empty statuses directly
    22072b0 [Matei Zaharia] Use Guava caches and add a config option for 
caching metadata
    8fb56ce [Matei Zaharia] Cache file block locations too
    453bd21 [Matei Zaharia] Bug fix
    4094df6 [Matei Zaharia] First attempt at caching Parquet footers
    
    (cherry picked from commit 9eb74c7d2cbe127dd4c32bf1a8318497b2fb55b6)
    Signed-off-by: Michael Armbrust <[email protected]>

commit cc4015d2fa3785b92e6ab079b3abcf17627f7c56
Author: Michael Armbrust <[email protected]>
Date:   2014-08-18T20:17:10Z

    [SPARK-2406][SQL] Initial support for using ParquetTableScan to read 
HiveMetaStore tables.
    
    This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` 
that when true causes the planner to detects tables that use Hive's Parquet 
SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.
    
    Author: Michael Armbrust <[email protected]>
    Author: Yin Huai <[email protected]>
    
    Closes #1819 from marmbrus/parquetMetastore and squashes the following 
commits:
    
    1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
    cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' 
into parquetMetastore
    4f3d54f [Michael Armbrust] fix style
    41ebc5f [Michael Armbrust] remove hive parquet bundle
    a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' 
into parquetMetastore
    4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
    ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this 
later).
    c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition.  Add 
dirty hacks to retrieve partition values from the InputSplit.
    8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
    a0baec7 [Yin Huai] Partitioning columns can be resolved.
    1161338 [Michael Armbrust] Add a test to make sure conversion is actually 
happening
    212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to 
read HiveMetaStore tables.
    
    (cherry picked from commit 3abd0c1cda09bb575adc99847a619bc84af37fd0)
    Signed-off-by: Michael Armbrust <[email protected]>

commit e083334634ca0d7a25dee864fb2b9558ee92a2f7
Author: Davies Liu <[email protected]>
Date:   2014-08-18T20:58:35Z

    [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8
    
    bugfix: It will raise an exception when it try to encode non-ASCII strings 
into unicode. It should only encode unicode as "utf-8".
    
    Author: Davies Liu <[email protected]>
    
    Closes #2018 from davies/fix_utf8 and squashes the following commits:
    
    4db7967 [Davies Liu] fix saveAsTextFile() with utf-8
    
    (cherry picked from commit d1d0ee41c27f1d07fed0c5d56ba26c723cc3dc26)
    Signed-off-by: Josh Rosen <[email protected]>

commit 25cabd7eec6e499fce94bce0d45087e9d8726a50
Author: Marcelo Vanzin <[email protected]>
Date:   2014-08-18T21:10:10Z

    [SPARK-2718] [yarn] Handle quotes and other characters in user args.
    
    Due to the way Yarn runs things through bash, normal quoting doesn't
    work as expected. This change applies the necessary voodoo to the user
    args to avoid issues with bash and special characters.
    
    The change also uncovered an issue with the event logger app name
    sanitizing code; it wasn't cleaning up all "bad" characters, so
    sometimes it would fail to create the log dirs. I just added some
    more bad character replacements.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #1724 from vanzin/SPARK-2718 and squashes the following commits:
    
    cc84b89 [Marcelo Vanzin] Review feedback.
    c1a257a [Marcelo Vanzin] Add test for backslashes.
    55571d4 [Marcelo Vanzin] Unbreak yarn-client.
    515613d [Marcelo Vanzin] [SPARK-2718] [yarn] Handle quotes and other 
characters in user args.
    
    (cherry picked from commit 6201b27643023569e19b68aa9d5c4e4e59ce0d79)
    Signed-off-by: Andrew Or <[email protected]>

commit 98778fffdb4e11593149eb7770071a0728653f19
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T21:40:05Z

    [mllib] DecisionTree: treeAggregate + Python example bug fix
    
    Small DecisionTree updates:
    * Changed main DecisionTree aggregate to treeAggregate.
    * Fixed bug in python example decision_tree_runner.py with missing argument 
(since categoricalFeaturesInfo is no longer an optional argument for 
trainClassifier).
    * Fixed same bug in python doc tests, and added tree.py to doc tests.
    
    CC: mengxr
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:
    
    b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
    8e4665d [Joseph K. Bradley] Added tree.py to python doc tests.  Fixed bug 
from missing categoricalFeaturesInfo argument.
    b7b2922 [Joseph K. Bradley] Fixed bug in python example 
decision_tree_runner.py with missing argument.  Changed main DecisionTree 
aggregate to treeAggregate.
    85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt2
    66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt2
    a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  
Small doc updates.
    3726d20 [Joseph K. Bradley] Small code improvements based on code review.
    ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main 
change: Now using << instead of math.pow.
    db0d773 [Joseph K. Bradley] scala style fix
    6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
    931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt2
    797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second 
level.  Needed to update treePointToNodeIndex with groupShift.
    f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
    5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
    6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: 
persisting to memory + disk, not just memory.
    2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt1
    26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer 
used.  Removed debugging println calls in DecisionTree.scala.
    356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
    430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added 
some docs.
    d036089 [Joseph K. Bradley] Print timing info to logDebug.
    e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods 
private
    8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own 
file, and cleaned it up.  Removed debugging println calls from DecisionTree.  
Made TreePoint extend Serialiable
    a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt1
    c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: 
Updated calculateGainForSplit to take aggregates for a single (feature, split) 
pair. * Internal doc: findAggForOrderedFeatureClassification
    b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + 
small changes
    b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-opt
    0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
    3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint 
representation to avoid calling findBin multiple times. * (not working yet, but 
debugging)
    f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-timing
    bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-timing
    511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-timing
    a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
    
    (cherry picked from commit 115eeb30dd9c9dd10685a71f2c23ca23794d3142)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit e3f89e971b117e11d15e4b9b47e63da55f4e488b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-19T01:01:39Z

    [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
    
    Added examples for statistical summarization:
    * Scala: StatisticalSummary.scala
    ** Tests: correlation, MultivariateOnlineSummarizer
    * python: statistical_summary.py
    ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
    
    Added examples for random and sampled RDDs:
    * Scala: RandomAndSampledRDDs.scala
    * python: random_and_sampled_rdds.py
    * Both test:
    ** RandomRDDGenerators.normalRDD, normalVectorRDD
    ** RDD.sample, takeSample, sampleByKey
    
    Added sc.stop() to all examples.
    
    CorrelationSuite.scala
    * Added 1 test for RDDs with only 1 value
    
    RowMatrix.scala
    * numCols(): Added check for numRows = 0, with error message.
    * computeCovariance(): Added check for numRows <= 1, with error message.
    
    Python SparseVector (pyspark/mllib/linalg.py)
    * Added toDense() function
    
    python/run-tests script
    * Added stat.py (doc test)
    
    CC: mengxr dorx  Main changes were examples to show usage across APIs.
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #1878 from jkbradley/mllib-stats-api-check and squashes the 
following commits:
    
    ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into mllib-stats-api-check
    dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and 
sampled_rdds.py: Check for division by 0 and for missing key in maps.
    8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into mllib-stats-api-check
    60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python 
versions printing nan or NaN.
    b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into mllib-stats-api-check
    4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use 
NaN instead of nan.
    32173b7 [Joseph K. Bradley] Stats examples update.
    c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into mllib-stats-api-check
    cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into mllib-stats-api-check
    0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed 
statistical_summary.py to correlations.py
    ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for 
numRows = 0, with error message. * computeCovariance(): Added check for numRows 
<= 1, with error message.
    65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into mllib-stats-api-check
    8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * 
Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both 
test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, 
takeSample, sampleByKey
    064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into mllib-stats-api-check
    ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * 
Scala: StatisticalSummary.scala ** Tests: correlation, 
MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: 
correlation (since MultivariateOnlineSummarizer has no Python API)
    
    (cherry picked from commit c8b16ca0d86cc60fb960eebf0cb383f159a88b03)
    Signed-off-by: Xiangrui Meng <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-3716 [GraphX] Update Analytics.scala for...

Reply via email to