[GitHub] spark pull request: Updated the "Running Tests" section in README....

mrt Wed, 27 May 2015 15:23:09 -0700

GitHub user mrt opened a pull request:

    https://github.com/apache/spark/pull/6445


    Updated the "Running Tests" section in README.md with mvn test

    Replaced ./dev/run-tests with an a mvn test example

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mrt/spark doc-fix-may27-2015

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6445.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6445
    
----
commit 3a60bcb80d28b2679812b61c035c6ea21d42d8ed
Author: Wenchen Fan <[email protected]>
Date:   2015-05-13T19:47:48Z

    [SPARK-7551][DataFrame] support backticks for DataFrame attribute resolution
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #6074 from cloud-fan/7551 and squashes the following commits:
    
    e6f579e [Wenchen Fan] allow space
    2b86699 [Wenchen Fan] handle blank
    e218d99 [Wenchen Fan] address comments
    54c4209 [Wenchen Fan] fix 7551
    
    (cherry picked from commit 213a6f30fee4a1c416ea76b678c71877fd36ef18)
    Signed-off-by: Reynold Xin <[email protected]>

commit 11911b0ae92afe4b1dc3963ae2cec14d273e65c7
Author: Burak Yavuz <[email protected]>
Date:   2015-05-13T20:21:36Z

    [SPARK-7593] [ML] Python Api for ml.feature.Bucketizer
    
    Added `ml.feature.Bucketizer` to PySpark.
    
    cc mengxr
    
    Author: Burak Yavuz <[email protected]>
    
    Closes #6124 from brkyvz/ml-bucket and squashes the following commits:
    
    05285be [Burak Yavuz] added sphinx doc
    6abb6ed [Burak Yavuz] added support for Bucketizer
    
    (cherry picked from commit 5db18ba6e1bd8c6307c41549176c53590cf344a0)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit d9fb905bebfc6f814a5dbcdd2e19ad6f46070025
Author: leahmcguire <[email protected]>
Date:   2015-05-13T21:13:19Z

    [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that 
both training and predict features have values of 0 or 1
    
    Author: leahmcguire <[email protected]>
    
    Closes #6073 from leahmcguire/binaryCheckNB and squashes the following 
commits:
    
    b8442c2 [leahmcguire] changed to if else for value checks
    911bf83 [leahmcguire] undid reformat
    4eedf1e [leahmcguire] moved bernoulli check
    9ee9e84 [leahmcguire] fixed style error
    3f3b32c [leahmcguire] fixed zero one check so only called in combiner
    831fd27 [leahmcguire] got test working
    f44bb3c [leahmcguire] removed changes from CV branch
    67253f0 [leahmcguire] added check to bernoulli to ensure feature values are 
zero or one
    f191c71 [leahmcguire] fixed name
    58d060b [leahmcguire] changed param name and test according to comments
    04f0d3c [leahmcguire] Added stats from cross validation as a val in the 
cross validation model to save them for user access
    
    (cherry picked from commit 61e05fc58e1245de871c409b60951745b5db3420)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit 51230f2a9e4ddc694a59b96693f1661b31b48c8e
Author: Burak Yavuz <[email protected]>
Date:   2015-05-13T22:13:09Z

    [SPARK-7382] [MLLIB] Feature Parity in PySpark for ml.classification
    
    The missing pieces in ml.classification for Python!
    
    cc mengxr
    
    Author: Burak Yavuz <[email protected]>
    
    Closes #6106 from brkyvz/ml-class and squashes the following commits:
    
    dd78237 [Burak Yavuz] fix style
    1048e29 [Burak Yavuz] ready for PR
    
    (cherry picked from commit df2fb1305aba6781017b0973b0965b664f835e31)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit d5c52d9ac1307522a5ed6ccb5d30a01c5d737c67
Author: scwf <[email protected]>
Date:   2015-05-13T23:13:48Z

    [SPARK-7303] [SQL] push down project if possible when the child is sort
    
    Optimize the case of `project(_, sort)` , a example is:
    
    `select key from (select * from testData order by key) t`
    
    before this PR:
    ```
    == Parsed Logical Plan ==
    'Project ['key]
     'Subquery t
      'Sort ['key ASC], true
       'Project [*]
        'UnresolvedRelation [testData], None
    
    == Analyzed Logical Plan ==
    Project [key#0]
     Subquery t
      Sort [key#0 ASC], true
       Project [key#0,value#1]
        Subquery testData
         LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
    
    == Optimized Logical Plan ==
    Project [key#0]
     Sort [key#0 ASC], true
      LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
    
    == Physical Plan ==
    Project [key#0]
     Sort [key#0 ASC], true
      Exchange (RangePartitioning [key#0 ASC], 5), []
       PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
    ```
    
    after this PR
    ```
    == Parsed Logical Plan ==
    'Project ['key]
     'Subquery t
      'Sort ['key ASC], true
       'Project [*]
        'UnresolvedRelation [testData], None
    
    == Analyzed Logical Plan ==
    Project [key#0]
     Subquery t
      Sort [key#0 ASC], true
       Project [key#0,value#1]
        Subquery testData
         LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
    
    == Optimized Logical Plan ==
    Sort [key#0 ASC], true
     Project [key#0]
      LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
    
    == Physical Plan ==
    Sort [key#0 ASC], true
     Exchange (RangePartitioning [key#0 ASC], 5), []
      Project [key#0]
       PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
    ```
    
    with this rule we will first do column pruning on the table and then do 
sorting.
    
    Author: scwf <[email protected]>
    
    This patch had conflicts when merged, resolved by
    Committer: Michael Armbrust <[email protected]>
    
    Closes #5838 from scwf/pruning and squashes the following commits:
    
    b00d833 [scwf] address michael's comment
    e230155 [scwf] fix tests failure
    b09b895 [scwf] improve column pruning
    
    (cherry picked from commit 59250fe51486908f9e3f3d9ef10aadbcb9b4d62d)
    Signed-off-by: Michael Armbrust <[email protected]>

commit acd872bbdbbd2b998d3fd0b79863fd9cdae62e78
Author: Reynold Xin <[email protected]>
Date:   2015-05-13T23:15:31Z

    [SQL] Move some classes into packages that are more appropriate.
    
    JavaTypeInference into catalyst
    types.DateUtils into catalyst
    CacheManager into execution
    DefaultParserDialect into catalyst
    
    Author: Reynold Xin <[email protected]>
    
    Closes #6108 from rxin/sql-rename and squashes the following commits:
    
    3fc9613 [Reynold Xin] Fixed import ordering.
    83d9ff4 [Reynold Xin] Fixed codegen tests.
    e271e86 [Reynold Xin] mima
    f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more 
appropriate.
    
    (cherry picked from commit e683182c3e6347afdac0e5658487f80e5e054ef4)
    Signed-off-by: Michael Armbrust <[email protected]>

commit ec342308a8874b8a31c5d7ded7bce5c6afee187d
Author: Andrew Or <[email protected]>
Date:   2015-05-13T23:27:48Z

    [SPARK-7608] Clean up old state in RDDOperationGraphListener
    
    This is necessary for streaming and long-running Spark applications. 
zsxwing tdas
    
    Author: Andrew Or <[email protected]>
    
    Closes #6125 from andrewor14/viz-listener-leak and squashes the following 
commits:
    
    8660949 [Andrew Or] Fix thing + add tests
    33c0843 [Andrew Or] Clean up old job state
    
    (cherry picked from commit f6e18388d993d99f768c6d547327e0720ec64224)
    Signed-off-by: Andrew Or <[email protected]>

commit e6b8cef514d01e085a20b5a9163e71741bbc068f
Author: Andrew Or <[email protected]>
Date:   2015-05-13T23:28:37Z

    [SPARK-7399] Spark compilation error for scala 2.11
    
    Subsequent fix following #5966. I tried this out locally.
    
    Author: Andrew Or <[email protected]>
    
    Closes #6129 from andrewor14/211-compilation and squashes the following 
commits:
    
    713868f [Andrew Or] Fix compilation issue for scala 2.11
    
    (cherry picked from commit f88ac701552a1a854247509db49d78f13515eae4)
    Signed-off-by: Andrew Or <[email protected]>

commit 4b4f10bc9008b2dac39d9ede96c8bfe06f8441a8
Author: Andrew Or <[email protected]>
Date:   2015-05-13T23:29:10Z

    [SPARK-7464] DAG visualization: highlight the same RDDs on hover
    
    This is pretty useful for MLlib.
    
    <img 
src="https://cloud.githubusercontent.com/assets/2133137/7599650/c7d03dd8-f8b8-11e4-8c0a-0a89e786c90f.png";
 width="400px"/>
    
    Author: Andrew Or <[email protected]>
    
    Closes #6100 from andrewor14/dag-viz-hover and squashes the following 
commits:
    
    fefe2af [Andrew Or] Link tooltips for nodes that belong to the same RDD
    90c6a7e [Andrew Or] Assign classes to clusters and nodes, not IDs
    
    (cherry picked from commit 44403414d3e754f7b991c0bbeb4868edb4135aa2)
    Signed-off-by: Andrew Or <[email protected]>

commit 895d46a24a5428516491e66ff534c7886f9a4d45
Author: Andrew Or <[email protected]>
Date:   2015-05-13T23:29:52Z

    [SPARK-7502] DAG visualization: gracefully handle removed stages
    
    Old stages are removed without much feedback to the user. This happens very 
often in streaming. See screenshots below for more detail. zsxwing
    
    **Before**
    
    <img 
src="https://cloud.githubusercontent.com/assets/2133137/7621031/643cc1e0-f978-11e4-8f42-09decaac44a7.png";
 width="500px"/>
    
    -------------------------
    **After**
    <img 
src="https://cloud.githubusercontent.com/assets/2133137/7621037/6e37348c-f978-11e4-84a5-e44e154f9b13.png";
 width="400px"/>
    
    Author: Andrew Or <[email protected]>
    
    Closes #6132 from andrewor14/dag-viz-remove-gracefully and squashes the 
following commits:
    
    43175cd [Andrew Or] Handle removed jobs and stages gracefully
    
    (cherry picked from commit aa1837875a3febad2f22b91a294f91749852b42f)
    Signed-off-by: Andrew Or <[email protected]>

commit e499a1e61b4dab7b8625e9c9a7a081386885f27d
Author: Andrew Or <[email protected]>
Date:   2015-05-13T23:31:24Z

    [STREAMING] [MINOR] Keep streaming.UIUtils private
    
    zsxwing
    
    Author: Andrew Or <[email protected]>
    
    Closes #6134 from andrewor14/private-streaming-uiutils and squashes the 
following commits:
    
    225df94 [Andrew Or] Privatize class
    
    (cherry picked from commit bb6dec3b160b54488892a509965fee70a530deff)
    Signed-off-by: Andrew Or <[email protected]>

commit 6c0644ae225b5239998d896c1cf3482fbdd35254
Author: Hari Shreedharan <[email protected]>
Date:   2015-05-13T23:43:30Z

    [SPARK-7356] [STREAMING] Fix flakey tests in FlumePollingStreamSuite using 
SparkSink's batch CountDownLatch.
    
    This is meant to make the FlumePollingStreamSuite deterministic. Now we 
basically count the number of batches that have been completed - and then 
verify the results rather than sleeping for random periods of time.
    
    Author: Hari Shreedharan <[email protected]>
    
    Closes #5918 from harishreedharan/flume-test-fix and squashes the following 
commits:
    
    93f24f3 [Hari Shreedharan] Add an eventually block to ensure that all 
received data is processed. Refactor the dstream creation and remove redundant 
code.
    1108804 [Hari Shreedharan] [SPARK-7356][STREAMING] Fix flakey tests in 
FlumePollingStreamSuite using SparkSink's batch CountDownLatch.
    
    (cherry picked from commit 61d1e87c0d3d12dac0b724d1b84436f748227e99)
    Signed-off-by: Andrew Or <[email protected]>

commit c53ebea9db418099df50f9adc1a18cee7849cd97
Author: Josh Rosen <[email protected]>
Date:   2015-05-14T00:07:31Z

    [SPARK-7081] Faster sort-based shuffle path using binary processing 
cache-aware sort
    
    This patch introduces a new shuffle manager that enhances the existing 
sort-based shuffle with a new cache-friendly sort algorithm that operates 
directly on binary data. The goals of this patch are to lower memory usage and 
Java object overheads during shuffle and to speed up sorting. It also lays 
groundwork for follow-up patches that will enable end-to-end processing of 
serialized records.
    
    The new shuffle manager, `UnsafeShuffleManager`, can be enabled by setting 
`spark.shuffle.manager=tungsten-sort` in SparkConf.
    
    The new shuffle manager uses directly-managed memory to implement several 
performance optimizations for certain types of shuffles. In cases where the new 
performance optimizations cannot be applied, the new shuffle manager delegates 
to SortShuffleManager to handle those shuffles.
    
    UnsafeShuffleManager's optimizations will apply when _all_ of the following 
conditions hold:
    
     - The shuffle dependency specifies no aggregation or output ordering.
     - The shuffle serializer supports relocation of serialized values (this is 
currently supported
       by KryoSerializer and Spark SQL's custom serializers).
     - The shuffle produces fewer than 16777216 output partitions.
     - No individual record is larger than 128 MB when serialized.
    
    In addition, extra spill-merging optimizations are automatically applied 
when the shuffle compression codec supports concatenation of serialized 
streams. This is currently supported by Spark's LZF serializer.
    
    At a high-level, UnsafeShuffleManager's design is similar to Spark's 
existing SortShuffleManager.  In sort-based shuffle, incoming records are 
sorted according to their target partition ids, then written to a single map 
output file. Reducers fetch contiguous regions of this file in order to read 
their portion of the map output. In cases where the map output data is too 
large to fit in memory, sorted subsets of the output can are spilled to disk 
and those on-disk files are merged to produce the final output file.
    
    UnsafeShuffleManager optimizes this process in several ways:
    
     - Its sort operates on serialized binary data rather than Java objects, 
which reduces memory consumption and GC overheads. This optimization requires 
the record serializer to have certain properties to allow serialized records to 
be re-ordered without requiring deserialization.  See SPARK-4550, where this 
optimization was first proposed and implemented, for more details.
    
     - It uses a specialized cache-efficient sorter 
(UnsafeShuffleExternalSorter) that sorts arrays of compressed record pointers 
and partition ids. By using only 8 bytes of space per record in the sorting 
array, this fits more of the array into cache.
    
     - The spill merging procedure operates on blocks of serialized records 
that belong to the same partition and does not need to deserialize records 
during the merge.
    
     - When the spill compression codec supports concatenation of compressed 
data, the spill merge simply concatenates the serialized and compressed spill 
partitions to produce the final output partition.  This allows efficient data 
copying methods, like NIO's `transferTo`, to be used and avoids the need to 
allocate decompression or copying buffers during the merge.
    
    The shuffle read path is unchanged.
    
    This patch is similar to 
[SPARK-4550](http://issues.apache.org/jira/browse/SPARK-4550) / #4450 but uses 
a slightly different implementation. The `unsafe`-based implementation featured 
in this patch lays the groundwork for followup patches that will enable sorting 
to operate on serialized data pages that will be prepared by Spark SQL's new 
`unsafe` operators (such as the new aggregation operator introduced in #5725).
    
    ### Future work
    
    There are several tasks that build upon this patch, which will be left to 
future work:
    
    - [SPARK-7271](https://issues.apache.org/jira/browse/SPARK-7271) Redesign / 
extend the shuffle interfaces to accept binary data as input. The goal here is 
to let us bypass serialization steps in cases where the sort input is produced 
by an operator that operates directly on binary data.
    - Extension / redesign of the `Serializer` API. We can add new methods 
which allow serializers to determine the size requirements for serializing 
objects and for serializing objects directly to a specified memory address 
(similar to how `UnsafeRowConverter` works in Spark SQL).
    
    <!-- Reviewable:start -->
    [<img src="https://reviewable.io/review_button.png"; height=40 alt="Review 
on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5868)
    <!-- Reviewable:end -->
    
    Author: Josh Rosen <[email protected]>
    
    Closes #5868 from JoshRosen/unsafe-sort and squashes the following commits:
    
    ef0a86e [Josh Rosen] Fix scalastyle errors
    7610f2f [Josh Rosen] Add tests for proper cleanup of shuffle data.
    d494ffe [Josh Rosen] Fix deserialization of JavaSerializer instances.
    52a9981 [Josh Rosen] Fix some bugs in the address packing code.
    51812a7 [Josh Rosen] Change shuffle manager sort name to tungsten-sort
    4023fa4 [Josh Rosen] Add @Private annotation to some Java classes.
    de40b9d [Josh Rosen] More comments to try to explain metrics code
    df07699 [Josh Rosen] Attempt to clarify confusing metrics update code
    5e189c6 [Josh Rosen] Track time spend closing / flushing files; split 
TimeTrackingOutputStream into separate file.
    d5779c6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
unsafe-sort
    c2ce78e [Josh Rosen] Fix a missed usage of MAX_PARTITION_ID
    e3b8855 [Josh Rosen] Cleanup in UnsafeShuffleWriter
    4a2c785 [Josh Rosen] rename 'sort buffer' to 'pointer array'
    6276168 [Josh Rosen] Remove ability to disable spilling in 
UnsafeShuffleExternalSorter.
    57312c9 [Josh Rosen] Clarify fileBufferSize units
    2d4e4f4 [Josh Rosen] Address some minor comments in 
UnsafeShuffleExternalSorter.
    fdcac08 [Josh Rosen] Guard against overflow when expanding sort buffer.
    85da63f [Josh Rosen] Cleanup in UnsafeShuffleSorterIterator.
    0ad34da [Josh Rosen] Fix off-by-one in nextInt() call
    56781a1 [Josh Rosen] Rename UnsafeShuffleSorter to 
UnsafeShuffleInMemorySorter
    e995d1a [Josh Rosen] Introduce MAX_SHUFFLE_OUTPUT_PARTITIONS.
    e58a6b4 [Josh Rosen] Add more tests for PackedRecordPointer encoding.
    4f0b770 [Josh Rosen] Attempt to implement proper shuffle write metrics.
    d4e6d89 [Josh Rosen] Update to bit shifting constants
    69d5899 [Josh Rosen] Remove some unnecessary override vals
    8531286 [Josh Rosen] Add tests that automatically trigger spills.
    7c953f9 [Josh Rosen] Add test that covers 
UnsafeShuffleSortDataFormat.swap().
    e1855e5 [Josh Rosen] Fix a handful of misc. IntelliJ inspections
    39434f9 [Josh Rosen] Avoid integer multiplication overflow in 
getMemoryUsage (thanks FindBugs!)
    1e3ad52 [Josh Rosen] Delete unused ByteBufferOutputStream class.
    ea4f85f [Josh Rosen] Roll back an unnecessary change in Spillable.
    ae538dc [Josh Rosen] Document UnsafeShuffleManager.
    ec6d626 [Josh Rosen] Add notes on maximum # of supported shuffle partitions.
    0d4d199 [Josh Rosen] Bump up shuffle.memoryFraction to make tests pass.
    b3b1924 [Josh Rosen] Properly implement close() and flush() in 
DummySerializerInstance.
    1ef56c7 [Josh Rosen] Revise compression codec support in merger; test cross 
product of configurations.
    b57c17f [Josh Rosen] Disable some overly-verbose logs that rendered DEBUG 
useless.
    f780fb1 [Josh Rosen] Add test demonstrating which compression codecs 
support concatenation.
    4a01c45 [Josh Rosen] Remove unnecessary log message
    27b18b0 [Josh Rosen] That for inserting records AT the max record size.
    fcd9a3c [Josh Rosen] Add notes + tests for maximum record / page sizes.
    9d1ee7c [Josh Rosen] Fix MiMa excludes for ShuffleWriter change
    fd4bb9e [Josh Rosen] Use own ByteBufferOutputStream rather than Kryo's
    67d25ba [Josh Rosen] Update Exchange operator's copying logic to account 
for new shuffle manager
    8f5061a [Josh Rosen] Strengthen assertion to check partitioning
    01afc74 [Josh Rosen] Actually read data in UnsafeShuffleWriterSuite
    1929a74 [Josh Rosen] Update to reflect upstream ShuffleBlockManager -> 
ShuffleBlockResolver rename.
    e8718dd [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
unsafe-sort
    9b7ebed [Josh Rosen] More defensive programming RE: cleaning up spill files 
and memory after errors
    7cd013b [Josh Rosen] Begin refactoring to enable proper tests for spilling.
    722849b [Josh Rosen] Add workaround for transferTo() bug in merging code; 
refactor tests.
    9883e30 [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
unsafe-sort
    b95e642 [Josh Rosen] Refactor and document logic that decides when to spill.
    1ce1300 [Josh Rosen] More minor cleanup
    5e8cf75 [Josh Rosen] More minor cleanup
    e67f1ea [Josh Rosen] Remove upper type bound in ShuffleWriter interface.
    cfe0ec4 [Josh Rosen] Address a number of minor review comments:
    8a6fe52 [Josh Rosen] Rename UnsafeShuffleSpillWriter to 
UnsafeShuffleExternalSorter
    11feeb6 [Josh Rosen] Update TODOs related to shuffle write metrics.
    b674412 [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
unsafe-sort
    aaea17b [Josh Rosen] Add comments to UnsafeShuffleSpillWriter.
    4f70141 [Josh Rosen] Fix merging; now passes UnsafeShuffleSuite tests.
    133c8c9 [Josh Rosen] WIP towards testing UnsafeShuffleWriter.
    f480fb2 [Josh Rosen] WIP in mega-refactoring towards shuffle-specific sort.
    57f1ec0 [Josh Rosen] WIP towards packed record pointers for use in 
optimized shuffle sort.
    69232fd [Josh Rosen] Enable compressible address encoding for off-heap mode.
    7ee918e [Josh Rosen] Re-order imports in tests
    3aeaff7 [Josh Rosen] More refactoring and cleanup; begin cleaning iterator 
interfaces
    3490512 [Josh Rosen] Misc. cleanup
    f156a8f [Josh Rosen] Hacky metrics integration; refactor some interfaces.
    2776aca [Josh Rosen] First passing test for ExternalSorter.
    5e100b2 [Josh Rosen] Super-messy WIP on external sort
    595923a [Josh Rosen] Remove some unused variables.
    8958584 [Josh Rosen] Fix bug in calculating free space in current page.
    f17fa8f [Josh Rosen] Add missing newline
    c2fca17 [Josh Rosen] Small refactoring of SerializerPropertiesSuite to 
enable test re-use:
    b8a09fe [Josh Rosen] Back out accidental log4j.properties change
    bfc12d3 [Josh Rosen] Add tests for serializer relocation property.
    240864c [Josh Rosen] Remove PrefixComputer and require prefix to be 
specified as part of insert()
    1433b42 [Josh Rosen] Store record length as int instead of long.
    026b497 [Josh Rosen] Re-use a buffer in UnsafeShuffleWriter
    0748458 [Josh Rosen] Port UnsafeShuffleWriter to Java.
    87e721b [Josh Rosen] Renaming and comments
    d3cc310 [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
    e2d96ca [Josh Rosen] Expand serializer API and use new function to help 
control when new UnsafeShuffle path is used.
    e267cee [Josh Rosen] Fix compilation of UnsafeSorterSuite
    9c6cf58 [Josh Rosen] Refactor to use DiskBlockObjectWriter.
    253f13e [Josh Rosen] More cleanup
    8e3ec20 [Josh Rosen] Begin code cleanup.
    4d2f5e1 [Josh Rosen] WIP
    3db12de [Josh Rosen] Minor simplification and sanity checks in UnsafeSorter
    767d3ca [Josh Rosen] Fix invalid range in UnsafeSorter.
    e900152 [Josh Rosen] Add test for empty iterator in UnsafeSorter
    57a4ea0 [Josh Rosen] Make initialSize configurable in UnsafeSorter
    abf7bfe [Josh Rosen] Add basic test case.
    81d52c5 [Josh Rosen] WIP on UnsafeSorter
    
    (cherry picked from commit 73bed408fbb47dfc28063afa3898c27fbdec7735)
    Signed-off-by: Reynold Xin <[email protected]>

commit 820aaa6b9abe2045f6d392c430082601770e91c9
Author: Venkata Ramana Gollamudi <[email protected]>
Date:   2015-05-14T00:24:04Z

    [SPARK-7601] [SQL] Support Insert into JDBC Datasource
    
    Supported InsertableRelation for JDBC Datasource JDBCRelation.
    Example usage:
    sqlContext.sql(
          s"""
            |CREATE TEMPORARY TABLE testram1
            |USING org.apache.spark.sql.jdbc
            |OPTIONS (url '$url', dbtable 'testram1', user 'xx', password 'xx', 
driver 'com.h2.Driver')
          """.stripMargin.replaceAll("\n", " "))
    
    sqlContext.sql("insert into table testram1 select * from testsrc")
    sqlContext.sql("insert overwrite table testram1 select * from testsrc")
    
    Author: Venkata Ramana Gollamudi <[email protected]>
    
    Closes #6121 from gvramana/JDBCDatasource_insert and squashes the following 
commits:
    
    f3fb5f1 [Venkata Ramana Gollamudi] Support for JDBC Datasource 
InsertableRelation
    
    (cherry picked from commit 59aaa1dad6bee06e38ee5c03bdf82354242286ee)
    Signed-off-by: Michael Armbrust <[email protected]>

commit aec83949ab66632fbd665e5b886ee2de4e288fb8
Author: Tathagata Das <[email protected]>
Date:   2015-05-14T00:33:15Z

    [SPARK-6752] [STREAMING] [REVISED] Allow StreamingContext to be recreated 
from checkpoint and existing SparkContext
    
    This is a revision of the earlier version (see #5773) that passed the 
active SparkContext explicitly through a new set of Java and Scala API. The 
drawbacks are.
    
    * Hard to implement in python.
    * New API introduced. This is even more confusing since we are introducing 
getActiveOrCreate in SPARK-7553
    
    Furthermore, there is now a direct way get an existing active SparkContext 
or create a new on - SparkContext.getOrCreate(conf). Its better to use this to 
get the SparkContext rather than have a new API to explicitly pass the context.
    
    So in this PR I have
    * Removed the new versions of StreamingContext.getOrCreate() which took 
SparkContext
    * Added the ability to pick up existing SparkContext when the 
StreamingContext tries to create a SparkContext.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #6096 from tdas/SPARK-6752 and squashes the following commits:
    
    53f4b2d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' 
into SPARK-6752
    f024b77 [Tathagata Das] Removed extra API and used SparkContext.getOrCreate
    
    (cherry picked from commit bce00dac403d3be2be59218b7b93a56c34c68f1a)
    Signed-off-by: Tathagata Das <[email protected]>

commit d518c0369fa412567855980c3f0f426cde5c190d
Author: zsxwing <[email protected]>
Date:   2015-05-14T00:58:29Z

    [HOTFIX] Use 'new Job' in fsBasedParquet.scala
    
    Same issue as #6095
    
    cc liancheng
    
    Author: zsxwing <[email protected]>
    
    Closes #6136 from zsxwing/hotfix and squashes the following commits:
    
    4beea54 [zsxwing] Use 'new Job' in fsBasedParquet.scala
    
    (cherry picked from commit 728af88cf6be4c25a732ab7e4fe66c1ed0041164)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 2d4a961f82ccea3d3fc6d21fae1fc3a52e338634
Author: Andrew Or <[email protected]>
Date:   2015-05-14T04:04:13Z

    [HOT FIX #6125] Do not wait for all stages to start rendering
    
    zsxwing
    
    Author: Andrew Or <[email protected]>
    
    Closes #6138 from andrewor14/dag-viz-clean-properly and squashes the 
following commits:
    
    19d4e98 [Andrew Or] Add synchronize
    02542d6 [Andrew Or] Rename overloaded variable
    d11bee1 [Andrew Or] Don't wait until all stages have started before 
rendering

commit 82f387fe23d3f5477df5d1be9a47d6df63fcbcf6
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-14T04:27:17Z

    [SPARK-7612] [MLLIB] update NB training to use mllib's BLAS
    
    This is similar to the changes to k-means, which gives us better control on 
the performance. dbtsai
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #6128 from mengxr/SPARK-7612 and squashes the following commits:
    
    b5c24c5 [Xiangrui Meng] merge master
    a90e3ec [Xiangrui Meng] update NB training to use mllib's BLAS
    
    (cherry picked from commit d5f18de1657bfabf5493011e0b2c7ec29c02c64c)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 9ab4db29ffa9f0811a6ecc85638784fef7bf9c61
Author: DB Tsai <[email protected]>
Date:   2015-05-14T05:23:21Z

    [SPARK-7620] [ML] [MLLIB] Removed calling size, length in while condition 
to avoid extra JVM call
    
    Author: DB Tsai <[email protected]>
    
    Closes #6137 from dbtsai/clean and squashes the following commits:
    
    185816d [DB Tsai] fix compilication issue
    f418d08 [DB Tsai] first commit
    
    (cherry picked from commit d3db2fd66752e80865e9c7a75d8e8d945121697e)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit c80e0cff2596cf7e9d149b92980177a43bb501a3
Author: linweizhong <[email protected]>
Date:   2015-05-14T07:23:27Z

    [SPARK-7595] [SQL] Window will cause resolve failed with self join
    
    for example:
    table: src(key string, value string)
    sql: with v1 as(select key, count(value) over (partition by key) cnt_val 
from src), v2 as(select v1.key, v1_lag.cnt_val from v1, v1 v1_lag where v1.key 
= v1_lag.key) select * from v2 limit 5;
    then will analyze fail when resolving conflicting references in Join:
    'Limit 5
     'Project [*]
      'Subquery v2
       'Project ['v1.key,'v1_lag.cnt_val]
        'Filter ('v1.key = 'v1_lag.key)
         'Join Inner, None
          Subquery v1
           Project [key#95,cnt_val#94L]
            Window [key#95,value#96], 
[HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#96)
 WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#95], [], ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             Project [key#95,value#96]
              MetastoreRelation default, src, None
          Subquery v1_lag
           Subquery v1
            Project [key#97,cnt_val#94L]
             Window [key#97,value#98], 
[HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#98)
 WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#97], [], ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
              Project [key#97,value#98]
               MetastoreRelation default, src, None
    
    Conflicting attributes: cnt_val#94L
    
    Author: linweizhong <[email protected]>
    
    Closes #6114 from Sephiroth-Lin/spark-7595 and squashes the following 
commits:
    
    f8f2637 [linweizhong] Add unit test
    dfe9169 [linweizhong] Handle windowExpression with self join
    
    (cherry picked from commit 13e652b61a81b2d2e94088006fbd5fd4ed383e3d)
    Signed-off-by: Michael Armbrust <[email protected]>

commit e45cd9f73a624d86e09f3a0f5649fa5dc7090b38
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-14T08:22:15Z

    [SPARK-7407] [MLLIB] use uid + name to identify parameters
    
    A param instance is strongly attached to an parent in the current 
implementation. So if we make a copy of an estimator or a transformer in 
pipelines and other meta-algorithms, it becomes error-prone to copy the params 
to the copied instances. In this PR, a param is identified by its parent's UID 
and the param name. So it becomes loosely attached to its parent and all its 
derivatives. The UID is preserved during copying or fitting. All components now 
have a default constructor and a constructor that takes a UID as input. I keep 
the constructors for Param in this PR to reduce the amount of diff and moved 
`parent` as a mutable field.
    
    This PR still needs some clean-ups, and there are several spark.ml PRs 
pending. I'll try to get them merged first and then update this PR.
    
    jkbradley
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:
    
    c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into 
SPARK-7407
    520f0a2 [Xiangrui Meng] address comments
    2569168 [Xiangrui Meng] fix tests
    873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in 
shouldOwn
    409ea08 [Xiangrui Meng] minor updates
    83a163c [Xiangrui Meng] update JavaDeveloperApiExample
    5db5325 [Xiangrui Meng] update OneVsRest
    7bde7ae [Xiangrui Meng] merge master
    697fdf9 [Xiangrui Meng] update Bucketizer
    7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into 
SPARK-7407
    629d402 [Xiangrui Meng] fix LRSuite
    154516f [Xiangrui Meng] merge master
    aa4a611 [Xiangrui Meng] fix examples/compile
    a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
    fdbc415 [Xiangrui Meng] all tests passed
    c255f17 [Xiangrui Meng] fix tests in ParamsSuite
    818e1db [Xiangrui Meng] merge master
    e1160cf [Xiangrui Meng] fix tests
    fbc39f0 [Xiangrui Meng] pass test:compile
    108937e [Xiangrui Meng] pass compile
    8726d39 [Xiangrui Meng] use parent uid in Param
    eaeed35 [Xiangrui Meng] update Identifiable
    
    (cherry picked from commit 1b8625f4258d6d1a049d0ba60e39e9757f5a568b)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 58534b0ab3f3e5af2d2ac302e2f60b92548918ec
Author: DB Tsai <[email protected]>
Date:   2015-05-14T08:26:08Z

    [SPARK-7568] [ML] ml.LogisticRegression doesn't output the right prediction
    
    The difference is because we previously don't fit the intercept in Spark 
1.3. Here, we change the input `String` so that the probability of instance 6 
can be classified as `1.0` without any ambiguity.
    
    with lambda = 0.001 in current LOR implementation, the prediction is
    ```
    (4, spark i j k) --> prob=[0.1596407738787411,0.8403592261212589], 
prediction=1.0
    (5, l m n) --> prob=[0.8378325685476612,0.16216743145233883], prediction=0.0
    (6, spark hadoop spark) --> prob=[0.0692663313297627,0.9307336686702373], 
prediction=1.0
    (7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], 
prediction=0.0
    ```
    and the training accuracy is
    ```
    (0, a b c d e spark) --> prob=[0.0021342419881406746,0.9978657580118594], 
prediction=1.0
    (1, b d) --> prob=[0.9959176174854043,0.004082382514595685], prediction=0.0
    (2, spark f g h) --> prob=[0.0014541569986711233,0.9985458430013289], 
prediction=1.0
    (3, hadoop mapreduce) --> prob=[0.9982978367343561,0.0017021632656438518], 
prediction=0.0
    ```
    
    Author: DB Tsai <[email protected]>
    
    Closes #6109 from dbtsai/lor-example and squashes the following commits:
    
    ac63ce4 [DB Tsai] first commit
    
    (cherry picked from commit c1080b6fddb22d84694da2453e46a03fbc041576)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 67ed0aa0fd2a6712b0dc00c22d757de039ce4bf0
Author: FavioVazquez <[email protected]>
Date:   2015-05-14T14:22:58Z

    [SPARK-7249] Updated Hadoop dependencies due to inconsistency in the 
versions
    
    Updated Hadoop dependencies due to inconsistency in the versions. Now the 
global properties are the ones used by the hadoop-2.2 profile, and the profile 
was set to empty but kept for backwards compatibility reasons.
    
    Changes proposed by vanzin resulting from previous pull-request 
https://github.com/apache/spark/pull/5783 that did not fixed the problem 
correctly.
    
    Please let me know if this is the correct way of doing this, the comments 
of vanzin are in the pull-request mentioned.
    
    Author: FavioVazquez <[email protected]>
    
    Closes #5786 from FavioVazquez/update-hadoop-dependencies and squashes the 
following commits:
    
    11670e5 [FavioVazquez] - Added missing instance of -Phadoop-2.2 in 
create-release.sh
    379f50d [FavioVazquez] - Added instances of -Phadoop-2.2 in 
create-release.sh, run-tests, scalastyle and building-spark.md - Reconstructed 
docs to not ask users to rely on default behavior
    3f9249d [FavioVazquez] Merge branch 'master' of 
https://github.com/apache/spark into update-hadoop-dependencies
    31bdafa [FavioVazquez] - Added missing instances in -Phadoop-1 in 
create-release.sh, run-tests and in the building-spark documentation
    cbb93e8 [FavioVazquez] - Added comment related to SPARK-3710 about  
hadoop-yarn-server-tests in Hadoop 2.2 that fails to pull some needed 
dependencies
    83dc332 [FavioVazquez] - Cleaned up the main POM concerning the yarn 
profile - Erased hadoop-2.2 profile from yarn/pom.xml and its content was 
integrated into yarn/pom.xml
    93f7624 [FavioVazquez] - Deleted unnecessary comments and <activation> tag 
on the YARN profile in the main POM
    668d126 [FavioVazquez] - Moved <dependencies> <activation> and <properties> 
sections of the hadoop-2.2 profile in the YARN POM to the YARN profile in the 
root POM - Erased unnecessary hadoop-2.2 profile from the YARN POM
    fda6a51 [FavioVazquez] - Updated hadoop1 releases in create-release.sh  due 
to changes in the default hadoop version set - Erased unnecessary instance of 
-Dyarn.version=2.2.0 in create-release.sh - Prettify comment in yarn/pom.xml
    0470587 [FavioVazquez] - Erased unnecessary instance of -Phadoop-2.2 
-Dhadoop.version=2.2.0 in create-release.sh - Updated how the releases are made 
in the create-release.sh no that the default hadoop version is the 2.2.0 - 
Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in 
scalastyle - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 
in run-tests - Better example given in the hadoop-third-party-distributions.md 
now that the default hadoop version is 2.2.0
    a650779 [FavioVazquez] - Default value of avro.mapred.classifier has been 
set to hadoop2 in pom.xml - Cleaned up hadoop-2.3 and 2.4 profiles due to 
change in the default set in avro.mapred.classifier in pom.xml
    199f40b [FavioVazquez] - Erased unnecessary CDH5-specific note in 
docs/building-spark.md - Remove example of instance -Phadoop-2.2 
-Dhadoop.version=2.2.0 in docs/building-spark.md - Enabled hadoop-2.2 profile 
when the Hadoop version is 2.2.0, which is now the default .Added comment in 
the yarn/pom.xml to specify that.
    88a8b88 [FavioVazquez] - Simplified Hadoop profiles due to new setting of 
global properties in the pom.xml file - Added comment to specify that the 
hadoop-2.2 profile is now the default hadoop profile in the pom.xml file - 
Erased hadoop-2.2 from related hadoop profiles now that is a no-op in the 
make-distribution.sh file
    70b8344 [FavioVazquez] - Fixed typo in the make-distribution.sh file and 
added hadoop-1 in the Related profiles
    287fa2f [FavioVazquez] - Updated documentation about specifying the hadoop 
version in building-spark. Now is clear that Spark will build against Hadoop 
2.2.0 by default. - Added Cloudera CDH 5.3.3 without MapReduce example in the 
building-spark doc.
    1354292 [FavioVazquez] - Fixed hadoop-1 version to match jenkins build 
profile in hadoop1.0 tests and documentation
    6b4bfaf [FavioVazquez] - Cleanup in hadoop-2.x profiles since they 
contained mostly redundant stuff.
    7e9955d [FavioVazquez] - Updated Hadoop dependencies due to inconsistency 
in the versions. Now the global properties are the ones used by the hadoop-2.2 
profile, and the profile was set to empty but kept for backwards compatibility 
reasons
    660decc [FavioVazquez] - Updated Hadoop dependencies due to inconsistency 
in the versions. Now the global properties are the ones used by the hadoop-2.2 
profile, and the profile was set to empty but kept for backwards compatibility 
reasons
    ec91ce3 [FavioVazquez] - Updated protobuf-java version of 
com.google.protobuf dependancy to fix blocking error when connecting to HDFS 
via the Hadoop Cloudera HDFS CDH5 (fix for 2.5.0-cdh5.3.3 version)
    
    (cherry picked from commit 7fb715de6d90c3eb756935440f75b1de674f8ece)
    Signed-off-by: Sean Owen <[email protected]>

commit aa8a0f96378e71978ca8c07b7008488c165a8080
Author: Wenchen Fan <[email protected]>
Date:   2015-05-14T17:25:18Z

    [SQL][minor] rename apply for QueryPlanner
    
    A follow-up of https://github.com/apache/spark/pull/5624
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #6142 from cloud-fan/tmp and squashes the following commits:
    
    971a92b [Wenchen Fan] use plan instead of execute
    24c5ffe [Wenchen Fan] rename apply
    
    (cherry picked from commit f2cd00be350fdba3acfbfdf155701182d1c404fd)
    Signed-off-by: Reynold Xin <[email protected]>

commit a49a145884e23e8d98f8114d02f396b42bfaf3b5
Author: ksonj <[email protected]>
Date:   2015-05-14T22:10:58Z

    [SPARK-7278] [PySpark] DateType should find datetime.datetime acceptable
    
    DateType should not be restricted to `datetime.date` but accept 
`datetime.datetime` objects as well. Could someone with a little more insight 
verify this?
    
    Author: ksonj <[email protected]>
    
    Closes #6057 from ksonj/dates and squashes the following commits:
    
    68a158e [ksonj] DateType should find datetime.datetime acceptable too
    
    (cherry picked from commit 5d7d4f887d509e6d037d8fc5247d2e5f8a4563c9)
    Signed-off-by: Reynold Xin <[email protected]>

commit fceaffc49b02530ea6ebcf9c9e4a960ac0be31ab
Author: tedyu <[email protected]>
Date:   2015-05-14T22:26:35Z

    Make SPARK prefix a variable
    
    Author: tedyu <[email protected]>
    
    Closes #6153 from ted-yu/master and squashes the following commits:
    
    4e0bac5 [tedyu] Use JIRA_PROJECT_NAME as variable name
    ab982aa [tedyu] Make SPARK prefix a variable
    
    (cherry picked from commit 11a1a135d1fe892cd48a9116acc7554846aed84c)
    Signed-off-by: Reynold Xin <[email protected]>

commit 894214f9eaf477c51c7b909f24e8be5c7d5343a5
Author: Rex Xiong <[email protected]>
Date:   2015-05-14T23:55:31Z

    [SPARK-7598] [DEPLOY] Add aliveWorkers metrics in Master
    
    In Spark Standalone setup, when some workers are DEAD, they will stay in 
master worker list for a while.
    master.workers metrics for master is only showing the total number of 
workers, we need to monitor how many real ALIVE workers are there to ensure the 
cluster is healthy.
    
    Author: Rex Xiong <[email protected]>
    
    Closes #6117 from twilightgod/add-aliveWorker-metrics and squashes the 
following commits:
    
    6be69a5 [Rex Xiong] Fix comment for aliveWorkers metrics
    a882f39 [Rex Xiong] Fix style for aliveWorkers metrics
    38ce955 [Rex Xiong] Add aliveWorkers metrics in Master
    
    (cherry picked from commit 93dbb3ad83fd60444a38c3dc87a2053c667123af)
    Signed-off-by: Andrew Or <[email protected]>

commit 8d8876d3b3223557629ba4d9a4e71857755da6ea
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-14T23:56:32Z

    [SPARK-7643] [UI] use the correct size in RDDPage for storage info and 
partitions
    
    `dataDistribution` and `partitions` are `Option[Seq[_]]`. andrewor14 squito
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #6157 from mengxr/SPARK-7643 and squashes the following commits:
    
    99fe8a4 [Xiangrui Meng] use the correct size in RDDPage for storage info 
and partitions
    
    (cherry picked from commit 57ed16cf9372c109e84bd51b728f2c82940949a7)
    Signed-off-by: Andrew Or <[email protected]>

commit 3358485778b6c1ad23dc13856d3ba330f7e1a8e9
Author: zsxwing <[email protected]>
Date:   2015-05-14T23:57:33Z

    [SPARK-7649] [STREAMING] [WEBUI] Use window.localStorage to store the 
status rather than the url
    
    Use window.localStorage to store the status rather than the url so that the 
url won't be changed.
    
    cc tdas
    
    Author: zsxwing <[email protected]>
    
    Closes #6158 from zsxwing/SPARK-7649 and squashes the following commits:
    
    3c56fef [zsxwing] Use window.localStorage to store the status rather than 
the url
    
    (cherry picked from commit 0a317c124c3a43089cdb8f079345c8f2842238cd)
    Signed-off-by: Andrew Or <[email protected]>

commit 79983f17d9d733c425e6ca00be229cb0c3ddadb6
Author: zsxwing <[email protected]>
Date:   2015-05-14T23:58:36Z

    [SPARK-7645] [STREAMING] [WEBUI] Show milliseconds in the UI if the batch 
interval < 1 second
    
    I also updated the summary of the Streaming page.
    
    ![screen shot 2015-05-14 at 11 52 59 
am](https://cloud.githubusercontent.com/assets/1000778/7640103/13cdf68e-fa36-11e4-84ec-e2a3954f4319.png)
    ![screen shot 2015-05-14 at 12 39 33 
pm](https://cloud.githubusercontent.com/assets/1000778/7640151/4cc066ac-fa36-11e4-8494-2821d6a6f17c.png)
    
    Author: zsxwing <[email protected]>
    
    Closes #6154 from zsxwing/SPARK-7645 and squashes the following commits:
    
    5db6ca1 [zsxwing] Add UIUtils.formatBatchTime
    e4802df [zsxwing] Show milliseconds in the UI if the batch interval < 1 
second
    
    (cherry picked from commit b208f998b5800bdba4ce6651f172c26a8d7d351b)
    Signed-off-by: Andrew Or <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Updated the "Running Tests" section in README....

Reply via email to