[GitHub] spark pull request #19541: ABCD

souravaswal Fri, 20 Oct 2017 00:52:46 -0700

GitHub user souravaswal opened a pull request:

    https://github.com/apache/spark/pull/19541


    ABCD

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19541.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19541
    
----
commit 9e451bcf36151bf401f72dcd66001b9ceb079738
Author: Dongjoon Hyun <[email protected]>
Date:   2017-09-05T21:35:09Z

    [MINOR][DOC] Update `Partition Discovery` section to enumerate all 
available file sources
    
    ## What changes were proposed in this pull request?
    
    All built-in data sources support `Partition Discovery`. We had better 
update the document to give the users more benefit clearly.
    
    **AFTER**
    
    <img width="906" alt="1" 
src="https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png";>
    
    ## How was this patch tested?
    
    ```
    SKIP_API=1 jekyll serve --watch
    ```
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #19139 from dongjoon-hyun/partitiondiscovery.

commit 6a2325448000ba431ba3b982d181c017559abfe3
Author: jerryshao <[email protected]>
Date:   2017-09-06T01:39:39Z

    [SPARK-18061][THRIFTSERVER] Add spnego auth support for ThriftServer 
thrift/http protocol
    
    Spark ThriftServer doesn't support spnego auth for thrift/http protocol, 
this mainly used for knox+thriftserver scenario. Since in HiveServer2 
CLIService there already has existing codes to support it. So here copy it to 
Spark ThriftServer to make it support.
    
    Related Hive JIRA HIVE-6697.
    
    Manual verification.
    
    Author: jerryshao <[email protected]>
    
    Closes #18628 from jerryshao/SPARK-21407.
    
    Change-Id: I61ef0c09f6972bba982475084a6b0ae3a74e385e

commit 445f1790ade1c53cf7eee1f282395648e4d0992c
Author: jerryshao <[email protected]>
Date:   2017-09-06T04:28:54Z

    [SPARK-9104][CORE] Expose Netty memory metrics in Spark
    
    ## What changes were proposed in this pull request?
    
    This PR exposes Netty memory usage for Spark's `TransportClientFactory` and 
`TransportServer`, including the details of each direct arena and heap arena 
metrics, as well as aggregated metrics. The purpose of adding the Netty metrics 
is to better know the memory usage of Netty in Spark shuffle, rpc and others 
network communications, and guide us to better configure the memory size of 
executors.
    
    This PR doesn't expose these metrics to any sink, to leverage this feature, 
still requires to connect to either MetricsSystem or collect them back to 
Driver to display.
    
    ## How was this patch tested?
    
    Add Unit test to verify it, also manually verified in real cluster.
    
    Author: jerryshao <[email protected]>
    
    Closes #18935 from jerryshao/SPARK-9104.

commit 4ee7dfe41b27abbd4c32074ecc8f268f6193c3f4
Author: Riccardo Corbella <[email protected]>
Date:   2017-09-06T07:22:57Z

    [SPARK-21924][DOCS] Update structured streaming programming guide doc
    
    ## What changes were proposed in this pull request?
    
    Update the line "For example, the data (12:09, cat) is out of order and 
late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For 
example, the data (12:09, cat) is out of order and late, and it falls in 
windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured 
streaming programming guide.
    
    Author: Riccardo Corbella <[email protected]>
    
    Closes #19137 from riccardocorbella/bugfix.

commit 16c4c03c71394ab30c8edaf4418973e1a2c5ebfe
Author: Bryan Cutler <[email protected]>
Date:   2017-09-06T12:12:27Z

    [SPARK-19357][ML] Adding parallel model evaluation in ML tuning
    
    ## What changes were proposed in this pull request?
    Modified `CrossValidator` and `TrainValidationSplit` to be able to evaluate 
models in parallel for a given parameter grid.  The level of parallelism is 
controlled by a parameter `numParallelEval` used to schedule a number of models 
to be trained/evaluated so that the jobs can be run concurrently.  This is a 
naive approach that does not check the cluster for needed resources, so care 
must be taken by the user to tune the parameter appropriately.  The default 
value is `1` which will train/evaluate in serial.
    
    ## How was this patch tested?
    Added unit tests for CrossValidator and TrainValidationSplit to verify that 
model selection is the same when run in serial vs parallel.  Manual testing to 
verify tasks run in parallel when param is > 1. Added parameter usage to 
relevant examples.
    
    Author: Bryan Cutler <[email protected]>
    
    Closes #16774 from BryanCutler/parallel-model-eval-SPARK-19357.

commit 64936c14a7ef30b9eacb129bafe6a1665887bf21
Author: hyukjinkwon <[email protected]>
Date:   2017-09-06T14:28:12Z

    [SPARK-21903][BUILD][FOLLOWUP] Upgrade scalastyle-maven-plugin and 
scalastyle as well in POM and SparkBuild.scala
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to match scalastyle version in POM and SparkBuild.scala
    
    ## How was this patch tested?
    
    Manual builds.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #19146 from HyukjinKwon/SPARK-21903-follow-up.

commit f2e22aebfe49cdfdf20f060305772971bcea9266
Author: Liang-Chi Hsieh <[email protected]>
Date:   2017-09-06T14:42:19Z

    [SPARK-21835][SQL] RewritePredicateSubquery should not produce unresolved 
query plans
    
    ## What changes were proposed in this pull request?
    
    Correlated predicate subqueries are rewritten into `Join` by the rule 
`RewritePredicateSubquery`  during optimization.
    
    It is possibly that the two sides of the `Join` have conflicting 
attributes. The query plans produced by `RewritePredicateSubquery` become 
unresolved and break structural integrity.
    
    We should check if there are conflicting attributes in the `Join` and 
de-duplicate them by adding a `Project`.
    
    ## How was this patch tested?
    
    Added tests.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #19050 from viirya/SPARK-21835.

commit 36b48ee6e92661645648a001d0d83623a8e5d601
Author: Felix Cheung <[email protected]>
Date:   2017-09-06T16:53:55Z

    [SPARK-21801][SPARKR][TEST] set random seed for predictable test
    
    ## What changes were proposed in this pull request?
    
    set.seed() before running tests
    
    ## How was this patch tested?
    
    jenkins, appveyor
    
    Author: Felix Cheung <[email protected]>
    
    Closes #19111 from felixcheung/rranseed.

commit acdf45fb52e29a0308cccdbef0ec0dca0815d300
Author: Jose Torres <[email protected]>
Date:   2017-09-06T18:19:46Z

    [SPARK-21765] Check that optimization doesn't affect isStreaming bit.
    
    ## What changes were proposed in this pull request?
    
    Add an assert in logical plan optimization that the isStreaming bit stays 
the same, and fix empty relation rules where that wasn't happening.
    
    ## How was this patch tested?
    
    new and existing unit tests
    
    Author: Jose Torres <[email protected]>
    Author: Jose Torres <[email protected]>
    
    Closes #19056 from joseph-torres/SPARK-21765-followup.

commit fa0092bddf695a757f5ddaed539e55e2dc9fccb7
Author: Jacek Laskowski <[email protected]>
Date:   2017-09-06T22:48:48Z

    [SPARK-21901][SS] Define toString for StateOperatorProgress
    
    ## What changes were proposed in this pull request?
    
    Just `StateOperatorProgress.toString` + few formatting fixes
    
    ## How was this patch tested?
    
    Local build. Waiting for OK from Jenkins.
    
    Author: Jacek Laskowski <[email protected]>
    
    Closes #19112 from 
jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.

commit aad2125475dcdeb4a0410392b6706511db17bac4
Author: Tucker Beck <[email protected]>
Date:   2017-09-07T00:38:00Z

    Fixed pandoc dependency issue in python/setup.py
    
    ## Problem Description
    
    When pyspark is listed as a dependency of another package, installing
    the other package will cause an install failure in pyspark. When the
    other package is being installed, pyspark's setup_requires requirements
    are installed including pypandoc. Thus, the exception handling on
    setup.py:152 does not work because the pypandoc module is indeed
    available. However, the pypandoc.convert() function fails if pandoc
    itself is not installed (in our use cases it is not). This raises an
    OSError that is not handled, and setup fails.
    
    The following is a sample failure:
    ```
    $ which pandoc
    $ pip freeze | grep pypandoc
    pypandoc==1.4
    $ pip install pyspark
    Collecting pyspark
      Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
        100% 
|ââââââââââââââââââââââââââââââââ|
 188.3MB 16.8MB/s
        Complete output from command python setup.py egg_info:
        Maybe try:
    
            sudo apt-get install pandoc
        See http://johnmacfarlane.net/pandoc/installing.html
        for installation options
        ---------------------------------------------------------------
    
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in <module>
            long_description = pypandoc.convert('README.md', 'rst')
          File 
"/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
 line 69, in convert
            outputfile=outputfile, filters=filters)
          File 
"/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
 line 260, in _convert_input
            _ensure_pandoc_path()
          File 
"/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
 line 544, in _ensure_pandoc_path
            raise OSError("No pandoc was found: either install pandoc and add 
it\n"
        OSError: No pandoc was found: either install pandoc and add it
        to your PATH or or call pypandoc.download_pandoc(...) or
        install pypandoc wheels with included pandoc.
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in 
/tmp/pip-build-mfnizcwa/pyspark/
    ```
    
    ## What changes were proposed in this pull request?
    
    This change simply adds an additional exception handler for the OSError
    that is raised. This allows pyspark to be installed client-side without 
requiring pandoc to be installed.
    
    ## How was this patch tested?
    
    I tested this by building a wheel package of pyspark with the change 
applied. Then, in a clean virtual environment with pypandoc installed but 
pandoc not available on the system, I installed pyspark from the wheel.
    
    Here is the output
    
    ```
    $ pip freeze | grep pypandoc
    pypandoc==1.4
    $ which pandoc
    $ pip install --no-cache-dir 
../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
    Processing 
/home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
    Requirement already satisfied: py4j==0.10.6 in 
/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from 
pyspark==2.3.0.dev0)
    Installing collected packages: pyspark
    Successfully installed pyspark-2.3.0.dev0
    ```
    
    Author: Tucker Beck <[email protected]>
    
    Closes #18981 from 
dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py.

commit ce7293c150c71a872d20beda44b12dec9deca18d
Author: Liang-Chi Hsieh <[email protected]>
Date:   2017-09-07T05:15:25Z

    [SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSubquery should not produce 
unresolved query plans
    
    ## What changes were proposed in this pull request?
    
    This is a follow-up of #19050 to deal with `ExistenceJoin` case.
    
    ## How was this patch tested?
    
    Added test.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #19151 from viirya/SPARK-21835-followup.

commit eea2b877cf4e6ba4ea524bf8d782516add1b093e
Author: Dongjoon Hyun <[email protected]>
Date:   2017-09-07T05:20:48Z

    [SPARK-21912][SQL] ORC/Parquet table should not create invalid column names
    
    ## What changes were proposed in this pull request?
    
    Currently, users meet job abortions while creating or altering ORC/Parquet 
tables with invalid column names. We had better prevent this by raising 
**AnalysisException** with a guide to use aliases instead like Paquet data 
source tables.
    
    **BEFORE**
    ```scala
    scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
    17/09/04 13:28:21 ERROR Utils: Aborting task
    java.lang.IllegalArgumentException: Error: : expected at the position 8 of 
'struct<a b:int>' but ' ' is found.
    17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 
aborted.
    17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
    org.apache.spark.SparkException: Task failed while writing rows.
    ```
    
    **AFTER**
    ```scala
    scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
    17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to 
write to table orc1
    org.apache.spark.sql.AnalysisException: Attribute name "a b" contains 
invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with a new test case.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #19124 from dongjoon-hyun/SPARK-21912.

commit b9ab791a9efb0dc165ba283c91acf831fa6be5d8
Author: Sanket Chintapalli <[email protected]>
Date:   2017-09-07T16:25:24Z

    [SPARK-21890] Credentials not being passed to add the tokens
    
    I observed this while running a oozie job trying to connect to hbase via 
spark.
    It look like the creds are not being passed in 
thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53
 for 2.2 release.
    More Info as to why it fails on secure grid:
    Oozie client gets the necessary tokens the application needs before 
launching. It passes those tokens along to the oozie launcher job (MR job) 
which will then actually call the Spark client to launch the spark app and pass 
the tokens along.
    The oozie launcher job cannot get anymore tokens because all it has is 
tokens ( you can't get tokens with tokens, you need tgt or keytab).
    The error here is because the launcher job runs the Spark Client to submit 
the spark job but the spark client doesn't see that it already has the hdfs 
tokens so it tries to get more, which ends with the exception.
    There was a change with SPARK-19021 to generalize the hdfs credentials 
provider that changed it so we don't pass the existing credentials into the 
call to get tokens so it doesn't realize it already has the necessary tokens.
    
    https://issues.apache.org/jira/browse/SPARK-21890
    Modified to pass creds to get delegation tokens
    
    Author: Sanket Chintapalli <[email protected]>
    
    Closes #19140 from redsanket/SPARK-21890-master.

commit e00f1a1da12be4a1fdb7b89eb5e098aa16c5c2c3
Author: Dongjoon Hyun <[email protected]>
Date:   2017-09-07T23:26:56Z

    [SPARK-13656][SQL] Delete spark.sql.parquet.cacheMetadata from SQLConf and 
docs
    
    ## What changes were proposed in this pull request?
    
    Since [SPARK-15639](https://github.com/apache/spark/pull/13701), 
`spark.sql.parquet.cacheMetadata` and `PARQUET_CACHE_METADATA` is not used. 
This PR removes from SQLConf and docs.
    
    ## How was this patch tested?
    
    Pass the existing Jenkins.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #19129 from dongjoon-hyun/SPARK-13656.

commit c26976fe148a2a59cec2f399484be73d08fb6b7f
Author: Dongjoon Hyun <[email protected]>
Date:   2017-09-08T01:31:13Z

    [SPARK-21939][TEST] Use TimeLimits instead of Timeouts
    
    Since ScalaTest 3.0.0, `org.scalatest.concurrent.Timeouts` is deprecated.
    This PR replaces the deprecated one with 
`org.scalatest.concurrent.TimeLimits`.
    
    ```scala
    -import org.scalatest.concurrent.Timeouts._
    +import org.scalatest.concurrent.TimeLimits._
    ```
    
    Pass the existing test suites.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #19150 from dongjoon-hyun/SPARK-21939.
    
    Change-Id: I1a1b07f1b97e51e2263dfb34b7eaaa099b2ded5e

commit 57bc1e9eb452284cbed090dbd5008eb2062f1b36
Author: Takuya UESHIN <[email protected]>
Date:   2017-09-08T05:26:07Z

    [SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop 
SparkContext.
    
    ## What changes were proposed in this pull request?
    
    `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in 
the test and it might affect the following tests.
    This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Takuya UESHIN <[email protected]>
    
    Closes #19158 from ueshin/issues/SPARK-21950.

commit f62b20f39c5e44ad6de535117e076060fef3f9ec
Author: liuxian <[email protected]>
Date:   2017-09-08T06:09:26Z

    [SPARK-21949][TEST] Tables created in unit tests should be dropped after use
    
    ## What changes were proposed in this pull request?
     Tables should be dropped after use in unit tests.
    ## How was this patch tested?
    N/A
    
    Author: liuxian <[email protected]>
    
    Closes #19155 from 10110346/droptable.

commit 6e37524a1fd26bbfe5034ecf971472931d1d47a9
Author: Liang-Chi Hsieh <[email protected]>
Date:   2017-09-08T06:12:18Z

    [SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer 
in test mode.
    
    ## What changes were proposed in this pull request?
    
    We have many optimization rules now in `Optimzer`. Right now we don't have 
any checks in the optimizer to check for the structural integrity of the plan 
(e.g. resolved). When debugging, it is difficult to identify which rules return 
invalid plans.
    
    It would be great if in test mode, we can check whether a plan is still 
resolved after the execution of each rule, so we can catch rules that return 
invalid plans.
    
    ## How was this patch tested?
    
    Added tests.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #18956 from viirya/SPARK-21726.

commit dbb824125d4d31166d9a47c330f8d51f5d159515
Author: Wenchen Fan <[email protected]>
Date:   2017-09-08T06:21:49Z

    [SPARK-21936][SQL] backward compatibility test framework for 
HiveExternalCatalog
    
    ## What changes were proposed in this pull request?
    
    `HiveExternalCatalog` is a semi-public interface. When creating tables, 
`HiveExternalCatalog` converts the table metadata to hive table format and save 
into hive metastore. It's very import to guarantee backward compatibility here, 
i.e., tables created by previous Spark versions should still be readable in 
newer Spark versions.
    
    Previously we find backward compatibility issues manually, which is really 
easy to miss bugs. This PR introduces a test framework to automatically test 
`HiveExternalCatalog` backward compatibility, by downloading Spark binaries 
with different versions, and create tables with these Spark versions, and read 
these tables with current Spark version.
    
    ## How was this patch tested?
    
    test-only change
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #19148 from cloud-fan/test.

commit 0dfc1ec59e45c836cb968bc9b77c69bf0e917b06
Author: Liang-Chi Hsieh <[email protected]>
Date:   2017-09-08T11:21:37Z

    [SPARK-21726][SQL][FOLLOW-UP] Check for structural integrity of the plan in 
Optimzer in test mode
    
    ## What changes were proposed in this pull request?
    
    The condition in `Optimizer.isPlanIntegral` is wrong. We should always 
return `true` if not in test mode.
    
    ## How was this patch tested?
    
    Manually test.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #19161 from viirya/SPARK-21726-followup.

commit 8a4f228dc0afed7992695486ecab6bc522f1e392
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-09-08T16:39:20Z

    [SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in 
InMemoryCatalogedDDLSuite
    
    ## What changes were proposed in this pull request?
    
    This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename 
cached table"`.
    Since this test validates distributed DataFrame, the result should be 
checked by using `checkAnswer`. The original version used `df.collect().Seq` 
method that does not guaranty an order of each element of the result.
    
    ## How was this patch tested?
    
    Use existing test case
    
    Author: Kazuaki Ishizaki <[email protected]>
    
    Closes #19159 from kiszk/SPARK-21946.

commit 8598d03a00a39dd23646bf752f9fed5d28e271c6
Author: hyukjinkwon <[email protected]>
Date:   2017-09-08T18:57:33Z

    [SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param 
methods & functions in dataframe
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to support unicodes in Param methods in ML, other missed 
functions in DataFrame.
    
    For example, this causes a `ValueError` in Python 2.x when param is a 
unicode string:
    
    ```python
    >>> from pyspark.ml.classification import LogisticRegression
    >>> lr = LogisticRegression()
    >>> lr.hasParam("threshold")
    True
    >>> lr.hasParam(u"threshold")
    Traceback (most recent call last):
     ...
        raise TypeError("hasParam(): paramName must be a string")
    TypeError: hasParam(): paramName must be a string
    ```
    
    This PR is based on https://github.com/apache/spark/pull/13036
    
    ## How was this patch tested?
    
    Unit tests in `python/pyspark/ml/tests.py` and 
`python/pyspark/sql/tests.py`.
    
    Author: hyukjinkwon <[email protected]>
    Author: sethah <[email protected]>
    
    Closes #17096 from HyukjinKwon/SPARK-15243.

commit 31c74fec24ae3bc8b9eb4ecd90896de459c3cc22
Author: Xin Ren <[email protected]>
Date:   2017-09-08T19:09:00Z

    [SPARK-19866][ML][PYSPARK] Add local version of Word2Vec findSynonyms for 
spark.ml: Python API
    
    https://issues.apache.org/jira/browse/SPARK-19866
    
    ## What changes were proposed in this pull request?
    
    Add Python API for findSynonymsArray matching Scala API.
    
    ## How was this patch tested?
    
    Manual test
    `./python/run-tests --python-executables=python2.7 --modules=pyspark-ml`
    
    Author: Xin Ren <[email protected]>
    Author: Xin Ren <[email protected]>
    Author: Xin Ren <[email protected]>
    
    Closes #17451 from keypointt/SPARK-19866.

commit 8a5eb5068104f527426fb2d0908f45c8eff0749f
Author: Andrew Ash <[email protected]>
Date:   2017-09-09T06:33:15Z

    [SPARK-21941] Stop storing unused attemptId in SQLTaskMetrics
    
    ## What changes were proposed in this pull request?
    
    In a driver heap dump containing 390,105 instances of SQLTaskMetrics this
    would have saved me approximately 3.2MB of memory.
    
    Since we're not getting any benefit from storing this unused value, let's
    eliminate it until a future PR makes use of it.
    
    ## How was this patch tested?
    
    Existing unit tests
    
    Author: Andrew Ash <[email protected]>
    
    Closes #19153 from ash211/aash/trim-sql-listener.

commit 6b45d7e941eba8a36be26116787322d9e3ae25d0
Author: Liang-Chi Hsieh <[email protected]>
Date:   2017-09-09T10:10:52Z

    [SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead 
of key type
    
    ## What changes were proposed in this pull request?
    
    `JacksonUtils.verifySchema` verifies if a data type can be converted to 
JSON. For `MapType`, it now verifies the key type. However, in 
`JacksonGenerator`, when converting a map to JSON, we only care about its 
values and create a writer for the values. The keys in a map are treated as 
strings by calling `toString` on the keys.
    
    Thus, we should change `JacksonUtils.verifySchema` to verify the value type 
of `MapType`.
    
    ## How was this patch tested?
    
    Added tests.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #19167 from viirya/test-jacksonutils.

commit e4d8f9a36ac27b0175f310bf5592b2881b025468
Author: Yanbo Liang <[email protected]>
Date:   2017-09-09T16:25:12Z

    [MINOR][SQL] Correct DataFrame doc.
    
    ## What changes were proposed in this pull request?
    Correct DataFrame doc.
    
    ## How was this patch tested?
    Only doc change, no tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #19173 from yanboliang/df-doc.

commit f76790557b063edc3080d5c792167e2f8b7060d1
Author: Jane Wang <[email protected]>
Date:   2017-09-09T18:48:34Z

    [SPARK-4131] Support "Writing data into the filesystem from queries"
    
    ## What changes were proposed in this pull request?
    
    This PR implements the sql feature:
    INSERT OVERWRITE [LOCAL] DIRECTORY directory1
      [ROW FORMAT row_format] [STORED AS file_format]
      SELECT ... FROM ...
    
    ## How was this patch tested?
    Added new unittests and also pulled the code to fb-spark so that we could 
test writing to hdfs directory.
    
    Author: Jane Wang <[email protected]>
    
    Closes #18975 from janewangfb/port_local_directory.

commit 520d92a191c3148498087d751aeeddd683055622
Author: Peter Szalai <[email protected]>
Date:   2017-09-10T08:47:45Z

    [SPARK-20098][PYSPARK] dataType's typeName fix
    
    ## What changes were proposed in this pull request?
    `typeName`  classmethod has been fixed by using type -> typeName map.
    
    ## How was this patch tested?
    local build
    
    Author: Peter Szalai <[email protected]>
    
    Closes #17435 from szalai1/datatype-gettype-fix.

commit 6273a711b69139ef0210f59759030a0b4a26b118
Author: Jen-Ming Chung <[email protected]>
Date:   2017-09-11T00:26:43Z

    [SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file
    
    ## What changes were proposed in this pull request?
    ```
    echo '{"field": 1}
    {"field": 2}
    {"field": "3"}' >/tmp/sample.json
    ```
    
    ```scala
    import org.apache.spark.sql.types._
    
    val schema = new StructType()
      .add("field", ByteType)
      .add("_corrupt_record", StringType)
    
    val file = "/tmp/sample.json"
    
    val dfFromFile = spark.read.schema(schema).json(file)
    
    scala> dfFromFile.show(false)
    +-----+---------------+
    |field|_corrupt_record|
    +-----+---------------+
    |1    |null           |
    |2    |null           |
    |null |{"field": "3"} |
    +-----+---------------+
    
    scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
    res1: Long = 0
    
    scala> dfFromFile.filter($"_corrupt_record".isNull).count()
    res2: Long = 3
    ```
    When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
This PR captures above situation and raise an exception with a reasonable 
workaround messag so that users can know what happened and how to fix the query.
    
    ## How was this patch tested?
    
    Added test case.
    
    Author: Jen-Ming Chung <[email protected]>
    
    Closes #18865 from jmchung/SPARK-21610.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19541: ABCD

Reply via email to