[GitHub] spark pull request: SPARK-6785: fix DateUtils.fromJavaDate(java.ut...

ckadner Mon, 18 May 2015 12:17:27 -0700

GitHub user ckadner opened a pull request:

    https://github.com/apache/spark/pull/6236


    SPARK-6785: fix DateUtils.fromJavaDate(java.util.Date) for Dates before 1970

    BUG: With the current implementation, the from-to-Java date conversion will 
be off by one day for Dates before 1970 because of a rounding (truncate) flaw 
in the function DateUtils.millisToDays(Long).
    
    FIX: The fix is to do the conversion using Double and flooring the 
fractions instead of truncatting them.
    
    NOTE: Before this fix, the code DID work for Dates that are not one 
millisecond before or after midnight in the system's local time zone.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ckadner/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6236.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6236
    
----
commit 514c5ca616fca01a3784441bea7493589675cf09
Author: Christian Kadner <[email protected]>
Date:   2015-05-09T01:00:22Z

    Merge pull request #1 from apache/master
    
    Update from original

commit be7e83a3278c8a171b051475f4fbd69610059795
Author: Christian Kadner <[email protected]>
Date:   2015-05-16T02:47:44Z

    Merge pull request #2 from apache/master
    
    Update my fork from original head, confirm merge

commit 8d014e14059b7c8bcde096bc052816f9e3b4e43e
Author: Tathagata Das <[email protected]>
Date:   2015-05-11T17:58:56Z

    [SPARK-7361] [STREAMING] Throw unambiguous exception when attempting to 
start multiple StreamingContexts in the same JVM
    
    Currently attempt to start a streamingContext while another one is started 
throws a confusing exception that the action name JobScheduler is already 
registered. Instead its best to throw a proper exception as it is not supported.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #5907 from tdas/SPARK-7361 and squashes the following commits:
    
    fb81c4a [Tathagata Das] Fix typo
    a9cd5bb [Tathagata Das] Added startSite to StreamingContext
    5fdfc0d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' 
into SPARK-7361
    5870e2b [Tathagata Das] Added check for multiple streaming contexts

commit 5a92e9d59d5c8c1be5544d6ac6bf612a4f3e45af
Author: Reynold Xin <[email protected]>
Date:   2015-05-12T02:15:14Z

    [SPARK-7324] [SQL] DataFrame.dropDuplicates
    
    This should also close https://github.com/apache/spark/pull/5870
    
    Author: Reynold Xin <[email protected]>
    
    Closes #6066 from rxin/dropDups and squashes the following commits:
    
    130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates

commit d749dcdbf38b6e3a790492eee8065b054f8ccc60
Author: Cheng Lian <[email protected]>
Date:   2015-05-12T17:32:28Z

    [SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources 
API
    
    This PR adds partitioning support for the external data sources API. It 
aims to simplify development of file system based data sources, and provide 
first class partitioning support for both read path and write path.  Existing 
data sources like JSON and Parquet can be simplified with this work.
    
    ## New features provided
    
    1. Hive compatible partition discovery
    
       This actually generalizes the partition discovery strategy used in 
Parquet data source in Spark 1.3.0.
    
    1. Generalized partition pruning optimization
    
       Now partition pruning is handled during physical planning phase.  
Specific data sources don't need to worry about this harness anymore.
    
       (This also implies that we can remove `CatalystScan` after migrating the 
Parquet data source, since now we don't need to pass Catalyst expressions to 
data source implementations.)
    
    1. Insertion with dynamic partitions
    
       When inserting data to a `FSBasedRelation`, data can be partitioned 
dynamically by specified partition columns.
    
    ## New structures provided
    
    ### Developer API
    
    1. `FSBasedRelation`
    
       Base abstract class for file system based data sources.
    
    1. `OutputWriter`
    
       Base abstract class for output row writers, responsible for writing a 
single row object.
    
    1. `FSBasedRelationProvider`
    
       A new relation provider for `FSBasedRelation` subclasses. Note that data 
sources extending `FSBasedRelation` don't need to extend `RelationProvider` and 
`SchemaRelationProvider`.
    
    ### User API
    
    New overloaded versions of
    
    1. `DataFrame.save()`
    1. `DataFrame.saveAsTable()`
    1. `SQLContext.load()`
    
    are provided to allow users to save/load DataFrames with user defined 
dynamic partition columns.
    
    ### Spark SQL query planning
    
    1. `InsertIntoFSBasedRelation`
    
       Used to implement write path for `FSBasedRelation`s.
    
    1. New rules for `FSBasedRelation` in `DataSourceStrategy`
    
       These are added to hook `FSBasedRelation` into physical query plan in 
read path, and perform partition pruning.
    
    ## TODO
    
    - [ ] Use scratch directories when overwriting a table with data selected 
from itself.
    
          Currently, this is not supported, because the table been overwritten 
is always deleted before writing any data to it.
    
    - [ ] When inserting with dynamic partition columns, use external sorter to 
group the data first.
    
          This ensures that we only need to open a single `OutputWriter` at a 
time.  For data sources like Parquet, `OutputWriter`s can be quite memory 
consuming.  One issue is that, this approach breaks the row distribution in the 
original DataFrame.  However, we did't promise to preserve data distribution 
when writing a DataFrame.
    
    - [x] More tests.  Specifically, test cases for
    
          - [x] Self-join
          - [x] Loading partitioned relations with a subset of partition 
columns stored in data files.
          - [x] `SQLContext.load()` with user defined dynamic partition columns.
    
    ## Parquet data source migration
    
    Parquet data source migration is covered in PR 
https://github.com/liancheng/spark/pull/6, which is against this PR branch and 
for preview only. A formal PR need to be made after this one is merged.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #5526 from liancheng/partitioning-support and squashes the following 
commits:
    
    5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing
    1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations
    43ba50e [Cheng Lian] Avoids serializing generated projection code
    edf49e7 [Cheng Lian] Removed commented stale code block
    348a922 [Cheng Lian] Adds projection in 
FSBasedRelation.buildScan(requiredColumns, inputPaths)
    ad4d4de [Cheng Lian] Enables HDFS style globbing
    8d12e69 [Cheng Lian] Fixes compilation error
    c71ac6c [Cheng Lian] Addresses comments from @marmbrus
    7552168 [Cheng Lian] Fixes typo in MimaExclude.scala
    0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing
    52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala
    c466de6 [Cheng Lian] Addresses comments
    bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data 
columns while inserting rows
    795920a [Cheng Lian] Fixes compilation error after rebasing
    0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing 
non-partitioned tables
    fa543f3 [Cheng Lian] Addresses comments
    5849dd0 [Cheng Lian] Fixes doc typos.  Fixes partition discovery refresh.
    51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with 
FSBasedRelation.prepareForWrite
    c4ed4fe [Cheng Lian] Bug fixes and a new test suite
    a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to 
FSBaseRelation.buildScan
    5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize 
OutputCommitter rather than OutputFormat
    54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used
    be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in 
OutputWriter.init
    0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can 
customize output format class
    f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer 
containers
    422ff4a [Cheng Lian] Fixes style issue
    ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined 
dynamic partition columns
    8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned 
relations
    ca1805b [Cheng Lian] Removes duplicated partition discovery code in new 
Parquet
    f18dec2 [Cheng Lian] More strict schema checking
    b746ab5 [Cheng Lian] More tests
    9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing
    ea6c8dd [Cheng Lian] Removes remote debugging stuff
    327bb1d [Cheng Lian] Implements partitioning support for data sources API
    3c5073a [Cheng Lian] Fixes SaveModes used in test cases
    fb5a607 [Cheng Lian] Fixes compilation error
    9d17607 [Cheng Lian] Adds the contract that OutputWriter should have 
zero-arg constructor
    5de194a [Cheng Lian] Forgot Apache licence header
    95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to 
FSBasedRelationProvider
    770b5ba [Cheng Lian] Adds tests for FSBasedRelation
    3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support 
partitioning
    1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation
    aa8ba9a [Cheng Lian] Javadoc fix
    012ed2d [Cheng Lian] Adds PartitioningOptions
    7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources 
API partitioning support

commit 50114f79d51ef50f97dff4e9dcd36f608964a3a1
Author: Wenchen Fan <[email protected]>
Date:   2015-05-12T18:51:55Z

    [SPARK-7276] [DATAFRAME] speed up DataFrame.select by collapsing Project
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #5831 from cloud-fan/7276 and squashes the following commits:
    
    ee4a1e1 [Wenchen Fan] fix rebase mistake
    a3b565d [Wenchen Fan] refactor
    99deb5d [Wenchen Fan] add test
    f1f67ad [Wenchen Fan] fix 7276

commit 8c32c5a9b1f265d72fcbbb70699a79688a07393b
Author: Ram Sriharsha <[email protected]>
Date:   2015-05-12T20:35:12Z

    [SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against All
    
    initial cut of one against all. test code is a scaffolding , not fully 
implemented.
    This WIP is to gather early feedback.
    
    Author: Ram Sriharsha <[email protected]>
    
    Closes #5830 from harsha2010/reduction and squashes the following commits:
    
    5f4b495 [Ram Sriharsha] Fix Test
    386e98b [Ram Sriharsha] Style fix
    49b4a17 [Ram Sriharsha] Simplify the test
    02279cc [Ram Sriharsha] Output Label Metadata in Prediction Col
    bc78032 [Ram Sriharsha] Code Review Updates
    8ce4845 [Ram Sriharsha] Merge with Master
    2a807be [Ram Sriharsha] Merge branch 'master' into reduction
    e21bfcc [Ram Sriharsha] Style Fix
    5614f23 [Ram Sriharsha] Style Fix
    c75583a [Ram Sriharsha] Cleanup
    7a5f136 [Ram Sriharsha] Fix TODOs
    804826b [Ram Sriharsha] Merge with Master
    1448a5f [Ram Sriharsha] Style Fix
    6e47807 [Ram Sriharsha] Style Fix
    d63e46b [Ram Sriharsha] Incorporate Code Review Feedback
    ced68b5 [Ram Sriharsha] Refactor OneVsAll to implement Predictor
    78fa82a [Ram Sriharsha] extra line
    0dfa1fb [Ram Sriharsha] Fix inexhaustive match cases that may arise from 
UnresolvedAttribute
    a59a4f4 [Ram Sriharsha] @Experimental
    4167234 [Ram Sriharsha] Merge branch 'master' into reduction
    868a4fd [Ram Sriharsha] @Experimental
    041d905 [Ram Sriharsha] Code Review Fixes
    df188d8 [Ram Sriharsha] Style fix
    612ec48 [Ram Sriharsha] Style Fix
    6ef43d3 [Ram Sriharsha] Prefer Unresolved Attribute to Option: Java APIs 
are cleaner
    6bf6bff [Ram Sriharsha] Update OneHotEncoder to new API
    e29cb89 [Ram Sriharsha] Merge branch 'master' into reduction
    1c7fa44 [Ram Sriharsha] Fix Tests
    ca83672 [Ram Sriharsha] Incorporate Code Review Feedback + Rename to 
OneVsRestClassifier
    221beeed [Ram Sriharsha] Upgrade to use Copy method for cloning Base 
Classifiers
    26f1ddb [Ram Sriharsha] Merge with SPARK-5956 API changes
    9738744 [Ram Sriharsha] Merge branch 'master' into reduction
    1a3e375 [Ram Sriharsha] More efficient Implementation: Use withColumn to 
generate label column dynamically
    32e0189 [Ram Sriharsha] Restrict reduction to Margin Based Classifiers
    ff272da [Ram Sriharsha] Style fix
    28771f5 [Ram Sriharsha] Add Tests for Multiclass to Binary Reduction
    b60f874 [Ram Sriharsha] Fix Style issues in Test
    3191cdf [Ram Sriharsha] Remove this test, accidental commit
    23f056c [Ram Sriharsha] Fix Headers for test
    1b5e929 [Ram Sriharsha] Fix Style issues and add Header
    8752863 [Ram Sriharsha] [SPARK-7015][MLLib][WIP] Multiclass to Binary 
Reduction: One Against All

commit fed2075d4793cb53696c2bba8c3d5559ea4a1259
Author: Joseph K. Bradley <[email protected]>
Date:   2015-05-12T23:42:30Z

    [SPARK-7573] [ML] OneVsRest cleanups
    
    Minor cleanups discussed with [~mengxr]:
    * move OneVsRest from reduction to classification sub-package
    * make model constructor private
    
    Some doc cleanups too
    
    CC: harsha2010  Could you please verify this looks OK?  Thanks!
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #6097 from jkbradley/onevsrest-cleanup and squashes the following 
commits:
    
    4ecd48d [Joseph K. Bradley] org imports
    430b065 [Joseph K. Bradley] moved OneVsRest from reduction subpackage to 
classification.  small java doc style fixes
    9f8b9b9 [Joseph K. Bradley] Small cleanups to OneVsRest.  Made model 
constructor private to ml package.

commit 9fc984954ed929e023c1ae33c5495f599f0ffe05
Author: Tathagata Das <[email protected]>
Date:   2015-05-12T23:44:14Z

    [SPARK-7553] [STREAMING] Added methods to maintain a singleton 
StreamingContext
    
    In a REPL/notebook environment, its very easy to lose a reference to a 
StreamingContext by overriding the variable name. So if you happen to execute 
the following commands
    ```
    val ssc = new StreamingContext(...) // cmd 1
    ssc.start() // cmd 2
    ...
    val ssc = new StreamingContext(...) // accidentally run cmd 1 again
    ```
    The value of ssc will be overwritten. Now you can neither start the new 
context (as only one context can be started), nor stop the previous context (as 
the reference is lost).
    Hence its best to maintain a singleton reference to the active context, so 
that we never loose reference for the active context.
    Since this problem occurs useful in REPL environments, its best to add this 
as an Experimental support in the Scala API only so that it can be used in 
Scala REPLs and notebooks.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #6070 from tdas/SPARK-7553 and squashes the following commits:
    
    731c9a1 [Tathagata Das] Fixed style
    a797171 [Tathagata Das] Added more unit tests
    19fc70b [Tathagata Das] Added :: Experimental :: in docs
    64706c9 [Tathagata Das] Fixed test
    634db5d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' 
into SPARK-7553
    3884a25 [Tathagata Das] Fixing test bug
    d37a846 [Tathagata Das] Added getActive and getActiveOrCreate

commit 84dc2e60d4051fc46c20fdc98e3efeb82199546c
Author: Cheng Lian <[email protected]>
Date:   2015-05-13T15:40:13Z

    [MINOR] [SQL] Removes debugging println
    
    Author: Cheng Lian <[email protected]>
    
    Closes #6123 from liancheng/remove-println and squashes the following 
commits:
    
    03356b6 [Cheng Lian] Removes debugging println

commit e3c941b3164454f7cb78a1c0d84b565d6f2e021c
Author: Cheng Lian <[email protected]>
Date:   2015-05-13T18:04:10Z

    [SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelation
    
    This PR migrates Parquet data source to the newly introduced 
`FSBasedRelation`. `FSBasedParquetRelation` is created to replace 
`ParquetRelation2`. Major differences are:
    
    1. Partition discovery code has been factored out to `FSBasedRelation`
    1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous 
subclass of `ParquetOutputFormat` is used to handle appending and writing 
dynamic partitions
    1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` 
only builds an `RDD[Row]` for a single selected partition
    1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter 
push down, thus it doesn't extend `CatalystScan` anymore
    
       After migrating `JSONRelation` (which extends `CatalystScan`), we can 
remove `CatalystScan`.
    
    <!-- Reviewable:start -->
    [<img src="https://reviewable.io/review_button.png"; height=40 alt="Review 
on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
    <!-- Reviewable:end -->
    
    Author: Cheng Lian <[email protected]>
    
    Closes #6090 from liancheng/parquet-migration and squashes the following 
commits:
    
    6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
    bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
    f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check 
whitelist
    261d8c1 [Cheng Lian] Minor bug fix and more tests
    db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation

commit 7077c10cb3117eeabb4a497f8affea0548f9c98b
Author: zsxwing <[email protected]>
Date:   2015-05-14T00:58:29Z

    [HOTFIX] Use 'new Job' in fsBasedParquet.scala
    
    Same issue as #6095
    
    cc liancheng
    
    Author: zsxwing <[email protected]>
    
    Closes #6136 from zsxwing/hotfix and squashes the following commits:
    
    4beea54 [zsxwing] Use 'new Job' in fsBasedParquet.scala

commit ed742fbf3a27e2ba7864bcc86dd4d6ff0232beef
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-14T08:22:15Z

    [SPARK-7407] [MLLIB] use uid + name to identify parameters
    
    A param instance is strongly attached to an parent in the current 
implementation. So if we make a copy of an estimator or a transformer in 
pipelines and other meta-algorithms, it becomes error-prone to copy the params 
to the copied instances. In this PR, a param is identified by its parent's UID 
and the param name. So it becomes loosely attached to its parent and all its 
derivatives. The UID is preserved during copying or fitting. All components now 
have a default constructor and a constructor that takes a UID as input. I keep 
the constructors for Param in this PR to reduce the amount of diff and moved 
`parent` as a mutable field.
    
    This PR still needs some clean-ups, and there are several spark.ml PRs 
pending. I'll try to get them merged first and then update this PR.
    
    jkbradley
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:
    
    c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into 
SPARK-7407
    520f0a2 [Xiangrui Meng] address comments
    2569168 [Xiangrui Meng] fix tests
    873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in 
shouldOwn
    409ea08 [Xiangrui Meng] minor updates
    83a163c [Xiangrui Meng] update JavaDeveloperApiExample
    5db5325 [Xiangrui Meng] update OneVsRest
    7bde7ae [Xiangrui Meng] merge master
    697fdf9 [Xiangrui Meng] update Bucketizer
    7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into 
SPARK-7407
    629d402 [Xiangrui Meng] fix LRSuite
    154516f [Xiangrui Meng] merge master
    aa4a611 [Xiangrui Meng] fix examples/compile
    a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
    fdbc415 [Xiangrui Meng] all tests passed
    c255f17 [Xiangrui Meng] fix tests in ParamsSuite
    818e1db [Xiangrui Meng] merge master
    e1160cf [Xiangrui Meng] fix tests
    fbc39f0 [Xiangrui Meng] pass test:compile
    108937e [Xiangrui Meng] pass compile
    8726d39 [Xiangrui Meng] use parent uid in Param
    eaeed35 [Xiangrui Meng] update Identifiable

commit 6006b60161b7f792ea4c50d21e8dccd8d850854c
Author: Cheng Lian <[email protected]>
Date:   2015-05-15T08:20:49Z

    [SPARK-7591] [SQL] Partitioning support API tweaks
    
    Please see [SPARK-7591] [1] for the details.
    
    /cc rxin marmbrus yhuai
    
    [1]: https://issues.apache.org/jira/browse/SPARK-7591
    
    Author: Cheng Lian <[email protected]>
    
    Closes #6150 from liancheng/spark-7591 and squashes the following commits:
    
    af422e7 [Cheng Lian] Addresses @rxin's comments
    37d1738 [Cheng Lian] Fixes HadoopFsRelation partition columns initialization
    2fc680a [Cheng Lian] Fixes Scala style issue
    189ad23 [Cheng Lian] Removes HadoopFsRelation constructor arguments
    522c24e [Cheng Lian] Adds OutputWriterFactory
    047d40d [Cheng Lian] Renames FSBased* to HadoopFs*, also renamed 
FSBasedParquetRelation back to ParquetRelation2

commit 1736caa5b3825319e08f459ad573636b2af5b7d1
Author: Christian Kadner <[email protected]>
Date:   2015-05-16T07:06:32Z

    Merge branch 'master' of https://github.com/ckadner/spark

commit dffeea40c5d59e9e755de78551523b9e68e1e36b
Author: Christian Kadner <[email protected]>
Date:   2015-05-16T07:21:09Z

    Merge branch 'master' of https://github.com/apache/spark

commit 0584d03cc4ab59e2f6f770a50f56048bd1daf432
Author: Christian Kadner <[email protected]>
Date:   2015-05-16T07:33:56Z

    Merge branch 'master' of https://github.com/apache/spark

commit e180b73e073536dd844c26997cc1cd532ae033df
Author: Christian Kadner <[email protected]>
Date:   2015-05-18T18:08:30Z

    Merge branch 'master' of https://github.com/apache/spark

commit 1f93f21eacf2846262e1511066cf841ad2140cfa
Author: Christian Kadner <[email protected]>
Date:   2015-05-18T18:59:38Z

    SPARK-6785: use Math.floor() when converting milliseconds to days in 
function millisToDays(Long) otherwise the result of calling function 
fromJavaDate(java.util.Date) will be off by one day for Dates before 1970

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-6785: fix DateUtils.fromJavaDate(java.ut...

Reply via email to