[GitHub] spark pull request #13615: Fix the import typo in Python example

sjjpo2002 Fri, 10 Jun 2016 17:02:25 -0700

GitHub user sjjpo2002 opened a pull request:

    https://github.com/apache/spark/pull/13615


    Fix the import typo in Python example

    ## What changes were proposed in this pull request?
    
    The Python [example 
](https://spark.apache.org/docs/latest/mllib-data-types.html#local-matrix) in 
local matrix data type in MLlib needs to fix the import line:
    
    `import org.apache.spark.mllib.linalg.{Matrix, Matrices}`
    
    It seems it has been copied from Scala by mistake.
    
    I couldn't find the source to apply the changes.
    
    
    
    ## How was this patch tested?
    
    It's a typo in the docs
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13615.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13615
    
----
commit dafcb05c2ef8e09f45edfb7eabf58116c23975a0
Author: Sameer Agarwal <[email protected]>
Date:   2016-05-23T06:32:39Z

    [SPARK-15425][SQL] Disallow cross joins by default
    
    ## What changes were proposed in this pull request?
    
    In order to prevent users from inadvertently writing queries with cartesian 
joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to 
`false` by default) that if not set, results in a `SparkException` if the query 
contains one or more cartesian products.
    
    ## How was this patch tested?
    
    Added a test to verify the new behavior in `JoinSuite`. Additionally, 
`SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable 
cartesian products.
    
    Author: Sameer Agarwal <[email protected]>
    
    Closes #13209 from sameeragarwal/disallow-cartesian.

commit 80091b8a6840b562cf76341926e5b828d4def7e2
Author: Davies Liu <[email protected]>
Date:   2016-05-23T17:48:25Z

    [SPARK-14031][SQL] speedup CSV writer
    
    ## What changes were proposed in this pull request?
    
    Currently, we create an CSVWriter for every row, it's very expensive and 
memory hungry, took about 15 seconds to write out 1 mm rows (two columns).
    
    This PR will write the rows in batch mode, create a CSVWriter for every 1k 
rows, which could write out 1 mm rows in about 1 seconds (15X faster).
    
    ## How was this patch tested?
    
    Manually benchmark it.
    
    Author: Davies Liu <[email protected]>
    
    Closes #13229 from davies/csv_writer.

commit 07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc
Author: Wenchen Fan <[email protected]>
Date:   2016-05-23T18:13:27Z

    [SPARK-15471][SQL] ScalaReflection cleanup
    
    ## What changes were proposed in this pull request?
    
    1. simplify the logic of deserializing option type.
    2. simplify the logic of serializing array type, and remove silentSchemaFor
    3. remove some unnecessary code.
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #13250 from cloud-fan/encoder.

commit 2585d2b322f3b6b85a0a12ddf7dcde957453000d
Author: Andrew Or <[email protected]>
Date:   2016-05-23T18:55:03Z

    [SPARK-15279][SQL] Catch conflicting SerDe when creating table
    
    ## What changes were proposed in this pull request?
    
    The user may do something like:
    ```
    CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET
    CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 
'myserde'
    CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC
    CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde'
    ```
    None of these should be allowed because the SerDe's conflict. As of this 
patch:
    - `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE`
    - `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and 
`SEQUENCEFILE`
    
    ## How was this patch tested?
    
    New tests in `DDLCommandSuite`.
    
    Author: Andrew Or <[email protected]>
    
    Closes #13068 from andrewor14/row-format-conflict.

commit 37c617e4f580482b59e1abbe3c0c27c7125cf605
Author: Dongjoon Hyun <[email protected]>
Date:   2016-05-23T21:19:25Z

    [MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF 
functions
    
    ## What changes were proposed in this pull request?
    
    Spark assumes that UDF functions are deterministic. This PR adds explicit 
notes about that.
    
    ## How was this patch tested?
    
    It's only about docs.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #13087 from dongjoon-hyun/SPARK-15282.

commit 03c7b7c4b9374f0cb6a29aeaf495bd21c2563de4
Author: sureshthalamati <[email protected]>
Date:   2016-05-24T00:15:19Z

    [SPARK-15315][SQL] Adding error check to  the CSV datasource writer for 
unsupported complex data types.
    
    ## What changes were proposed in this pull request?
    
    Adds error handling to the CSV writer  for unsupported complex data types.  
Currently garbage gets written to the output csv files if the data frame schema 
has complex data types.
    
    ## How was this patch tested?
    
    Added new unit test case.
    
    Author: sureshthalamati <[email protected]>
    
    Closes #13105 from sureshthalamati/csv_complex_types_SPARK-15315.

commit a8e97d17b91684e68290d9f18a43622232aa94e7
Author: hyukjinkwon <[email protected]>
Date:   2016-05-24T00:20:29Z

    [MINOR][SPARKR][DOC] Add a description for running unit tests in Windows
    
    ## What changes were proposed in this pull request?
    
    This PR adds the description for running unit tests in Windows.
    
    ## How was this patch tested?
    
    On a bare machine (Window 7, 32bits), this was manually built and tested.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #13217 from HyukjinKwon/minor-r-doc.

commit 01659bc50cd3d53815d205d005c3678e714c08e0
Author: Xin Wu <[email protected]>
Date:   2016-05-24T00:32:01Z

    [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively
    
    ## What changes were proposed in this pull request?
    Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively 
in SparkSQL. However, when this command is run, the file/jar is added to the 
resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because 
the `LIST` command is passed to Hive command processor in Spark-SQL or simply 
not supported in Spark-shell. There is no way users can find out what 
files/jars are added to the spark context.
    Refer to [Hive 
commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli)
    
    This PR is to support following commands:
    `LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])`
    
    ### For example:
    ##### LIST FILE(s)
    ```
    scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt")
    res1: org.apache.spark.sql.DataFrame = []
    scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt")
    res2: org.apache.spark.sql.DataFrame = []
    
    scala> spark.sql("list file 
hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false)
    +----------------------------------------------+
    |result                                        |
    +----------------------------------------------+
    |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
    +----------------------------------------------+
    
    scala> spark.sql("list files").show(false)
    +----------------------------------------------+
    |result                                        |
    +----------------------------------------------+
    |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
    |hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt |
    +----------------------------------------------+
    ```
    
    ##### LIST JAR(s)
    ```
    scala> spark.sql("add jar 
/Users/xinwu/spark/core/src/test/resources/TestUDTF.jar")
    res9: org.apache.spark.sql.DataFrame = [result: int]
    
    scala> spark.sql("list jar TestUDTF.jar").show(false)
    +---------------------------------------------+
    |result                                       |
    +---------------------------------------------+
    |spark://192.168.1.234:50131/jars/TestUDTF.jar|
    +---------------------------------------------+
    
    scala> spark.sql("list jars").show(false)
    +---------------------------------------------+
    |result                                       |
    +---------------------------------------------+
    |spark://192.168.1.234:50131/jars/TestUDTF.jar|
    +---------------------------------------------+
    ```
    ## How was this patch tested?
    New test cases are added for Spark-SQL, Spark-Shell and SparkContext API 
code path.
    
    Author: Xin Wu <[email protected]>
    Author: xin Wu <[email protected]>
    
    Closes #13212 from xwu0226/list_command.

commit 5afd927a47aa7ede3039234f2f7262e2247aa2ae
Author: gatorsmile <[email protected]>
Date:   2016-05-24T01:03:45Z

    [SPARK-15311][SQL] Disallow DML on Regular Tables when Using In-Memory 
Catalog
    
    #### What changes were proposed in this pull request?
    So far, when using In-Memory Catalog, we allow DDL operations for the 
tables. However, the corresponding DML operations are not supported for the 
tables that are neither temporary nor data source tables. For example,
    ```SQL
    CREATE TABLE tabName(i INT, j STRING)
    SELECT * FROM tabName
    INSERT OVERWRITE TABLE tabName SELECT 1, 'a'
    ```
    In the above example, before this PR fix, we will get very confusing 
exception messages for either `SELECT` or `INSERT`
    ```
    org.apache.spark.sql.AnalysisException: unresolved operator 
'SimpleCatalogRelation default, 
CatalogTable(`default`.`tbl`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(i,int,true,None),
 
CatalogColumn(j,string,true,None)),List(),List(),List(),-1,,1463928681802,-1,Map(),None,None,None,List()),
 None;
    ```
    
    This PR is to issue appropriate exceptions in this case. The message will 
be like
    ```
    org.apache.spark.sql.AnalysisException: Please enable Hive support when 
operating non-temporary tables: `tbl`;
    ```
    #### How was this patch tested?
    Added a test case in `DDLSuite`.
    
    Author: gatorsmile <[email protected]>
    Author: xiaoli <[email protected]>
    Author: Xiao Li <[email protected]>
    
    Closes #13093 from gatorsmile/selectAfterCreate.

commit a15ca5533db91fefaf3248255a59c4d94eeda1a9
Author: WeichenXu <[email protected]>
Date:   2016-05-24T01:14:48Z

    [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext 
with SparkSession using builder pattern in python test code
    
    ## What changes were proposed in this pull request?
    
    Replace SQLContext and SparkContext with SparkSession using builder pattern 
in python test code.
    
    ## How was this patch tested?
    
    Existing test.
    
    Author: WeichenXu <[email protected]>
    
    Closes #13242 from WeichenXu123/python_doctest_update_sparksession.

commit d207716451f722c899b3845ee454f1e16c094125
Author: gatorsmile <[email protected]>
Date:   2016-05-24T04:07:14Z

    [SPARK-15485][SQL][DOCS] Spark SQL Configuration
    
    #### What changes were proposed in this pull request?
    So far, the page Configuration in the official documentation does not have 
a section for Spark SQL.
    http://spark.apache.org/docs/latest/configuration.html
    
    For Spark users, the information and default values of these public 
configuration parameters are very useful. This PR is to add this missing 
section to the configuration.html.
    
    rxin yhuai marmbrus
    
    #### How was this patch tested?
    Below is the generated webpage.
    <img width="924" alt="screenshot 2016-05-23 11 35 57" 
src="https://cloud.githubusercontent.com/assets/11567269/15480492/b08fefc4-20da-11e6-9fa2-7cd5b699ed35.png";>
    <img width="914" alt="screenshot 2016-05-23 11 37 38" 
src="https://cloud.githubusercontent.com/assets/11567269/15480499/c5f9482e-20da-11e6-95ff-10821add1af4.png";>
    <img width="923" alt="screenshot 2016-05-23 11 36 11" 
src="https://cloud.githubusercontent.com/assets/11567269/15480506/cbd81644-20da-11e6-9d27-effb716b2fac.png";>
    <img width="920" alt="screenshot 2016-05-23 11 36 18" 
src="https://cloud.githubusercontent.com/assets/11567269/15480511/d013e332-20da-11e6-854a-cf8813c46f36.png";>
    
    Author: gatorsmile <[email protected]>
    
    Closes #13263 from gatorsmile/configurationSQL.

commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf
Author: Kazuaki Ishizaki <[email protected]>
Date:   2016-05-24T04:12:34Z

    [SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows 
beyond 64 KB
    
    ## What changes were proposed in this pull request?
    
    This PR splits the generated code for ```SafeProjection.apply``` by using 
```ctx.splitExpressions()```. This is because the large code body for 
```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method.
    
    ## How was this patch tested?
    
    Added new tests
    
    Author: Kazuaki Ishizaki <[email protected]>
    
    Closes #13243 from kiszk/SPARK-15285.

commit de726b0d533158d3ca08841bd6976bcfa26ca79d
Author: Andrew Or <[email protected]>
Date:   2016-05-24T04:43:11Z

    Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method 
grows beyond 64 KB"
    
    This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.

commit d642b273544bb77ef7f584326aa2d214649ac61b
Author: Daoyuan Wang <[email protected]>
Date:   2016-05-24T06:29:15Z

    [SPARK-15397][SQL] fix string udf locate as hive
    
    ## What changes were proposed in this pull request?
    
    in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` 
would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, 
`locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 
and  `locate("aa", "aaa", 2)` would yield 0. This results from the different 
understanding of the third parameter in udf `locate`. It means the starting 
index and starts from 1, so when we use 0, the return would always be 0.
    
    ## How was this patch tested?
    
    tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`
    
    Author: Daoyuan Wang <[email protected]>
    
    Closes #13186 from adrian-wang/locate.

commit 6075f5b4d8e98483d26c31576f58e2229024b4f4
Author: Nick Pentreath <[email protected]>
Date:   2016-05-24T08:02:10Z

    [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark 
QuantileDiscretizer
    
    This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` 
to match Scala.
    
    Also cleaned up a duplication of `numBuckets` where the param is both a 
class and instance attribute (I removed the instance attr to match the style of 
params throughout `ml`).
    
    Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it 
now uses `approxQuantile`.
    
    ## How was this patch tested?
    
    A little doctest and built API docs locally to check HTML doc generation.
    
    Author: Nick Pentreath <[email protected]>
    
    Closes #13228 from MLnick/SPARK-15442-py-relerror-param.

commit c24b6b679c3efa053f7de19be73eb36dc70d9930
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-05-24T16:43:39Z

    [SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work
    
    ## What changes were proposed in this pull request?
    
    Jackson suppprts `allowNonNumericNumbers` option to parse non-standard 
non-numeric numbers such as "NaN", "Infinity", "INF".  Currently used Jackson 
version (2.5.3) doesn't support it all. This patch upgrades the library and 
make the two ignored tests in `JsonParsingOptionsSuite` passed.
    
    ## How was this patch tested?
    
    `JsonParsingOptionsSuite`.
    
    Author: Liang-Chi Hsieh <[email protected]>
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #9759 from viirya/fix-json-nonnumric.

commit f8763b80ecd9968566018396c8cdc1851e7f8a46
Author: Dongjoon Hyun <[email protected]>
Date:   2016-05-24T17:08:14Z

    [SPARK-13135] [SQL] Don't print expressions recursively in generated code
    
    ## What changes were proposed in this pull request?
    
    This PR is an up-to-date and a little bit improved version of #11019 of 
rxin for
    - (1) preventing recursive printing of expressions in generated code.
    
    Since the major function of this PR is indeed the above,  he should be 
credited for the work he did. In addition to #11019, this PR improves the 
followings in code generation.
    - (2) Improve multiline comment indentation.
    - (3) Reduce the number of empty lines (mainly consecutive empty lines).
    - (4) Remove all space characters on empty lines.
    
    **Example**
    ```scala
    spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)
    ```
    
    **Before**
    ```
    Generated code:
    /* 001 */ public Object generate(Object[] references) {
    ...
    /* 005 */ /**
    /* 006 */ * Codegend pipeline for
    /* 007 */ * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 
3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
    /* 008 */ * +- Range 1, 1, 8, 999, [id#0L]
    /* 009 */ */
    ...
    /* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 
2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
    /* 076 */
    /* 077 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
    /* 078 */
    /* 079 */     // initialize Range
    ...
    /* 092 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) 
+ 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
    /* 093 */
    /* 094 */       // CONSUME: WholeStageCodegen
    /* 095 */
    /* 096 */       // (((input[0, bigint, false] + 1) + 2) + 3)
    /* 097 */       // ((input[0, bigint, false] + 1) + 2)
    /* 098 */       // (input[0, bigint, false] + 1)
    ...
    /* 107 */       // (((input[0, bigint, false] + 4) + 5) + 6)
    /* 108 */       // ((input[0, bigint, false] + 4) + 5)
    /* 109 */       // (input[0, bigint, false] + 4)
    ...
    /* 126 */ }
    ```
    
    **After**
    ```
    Generated code:
    /* 001 */ public Object generate(Object[] references) {
    ...
    /* 005 */ /**
    /* 006 */  * Codegend pipeline for
    /* 007 */  * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 
3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
    /* 008 */  * +- Range 1, 1, 8, 999, [id#0L]
    /* 009 */  */
    ...
    /* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 
2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
    /* 076 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
    /* 077 */     // initialize Range
    ...
    /* 090 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) 
+ 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
    /* 091 */       // CONSUME: WholeStageCodegen
    /* 092 */       // (((input[0, bigint, false] + 1) + 2) + 3)
    ...
    /* 101 */       // (((input[0, bigint, false] + 4) + 5) + 6)
    ...
    /* 118 */ }
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins tests and see the result of the following command manually.
    ```scala
    scala> spark.range(1, 1000).select('id+1+2+3, 
'id+4+5+6).queryExecution.debug.codegen()
    ```
    
    Author: Dongjoon Hyun <dongjoonapache.org>
    Author: Reynold Xin <rxindatabricks.com>
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #13192 from dongjoon-hyun/SPARK-13135.

commit 695d9a0fd461070ee2684b2210fb69d0b6ed1a95
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-05-24T17:10:41Z

    [SPARK-15433] [PYSPARK] PySpark core test should not use SerDe from 
PythonMLLibAPI
    
    ## What changes were proposed in this pull request?
    
    Currently PySpark core test uses the `SerDe` from `PythonMLLibAPI` which 
includes many MLlib things. It should use `SerDeUtil` instead.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #13214 from viirya/pycore-use-serdeutil.

commit a313a5ae74ae4e7686283657ba56076222317595
Author: Marcelo Vanzin <[email protected]>
Date:   2016-05-24T17:26:55Z

    [SPARK-15405][YARN] Remove unnecessary upload of config archive.
    
    We only need one copy of it. The client code that was uploading the
    second copy just needs to be modified to update the metadata in the
    cache, so that the AM knows where to find the configuration.
    
    Tested by running app on YARN and verifying in the logs only one archive
    is uploaded.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #13232 from vanzin/SPARK-15405.

commit 784cc07d1675eb9e0a387673cf86874e1bfc10f9
Author: wangyang <[email protected]>
Date:   2016-05-24T18:03:12Z

    [SPARK-15388][SQL] Fix spark sql CREATE FUNCTION with hive 1.2.1
    
    ## What changes were proposed in this pull request?
    
    spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") 
throws 
"org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function
 default.myfunc does not exist))" with hive 1.2.1.
    
    I think it is introduced by pr #12853. Fixing it by catching Exception (not 
NoSuchObjectException) and string matching.
    
    ## How was this patch tested?
    
    added a unit test and also tested it manually
    
    Author: wangyang <[email protected]>
    
    Closes #13177 from wangyang1992/fixCreateFunc2.

commit be99a99fe7976419d727c0cc92e872aa4af58bf1
Author: Dongjoon Hyun <[email protected]>
Date:   2016-05-24T18:09:54Z

    [MINOR][CORE][TEST] Update obsolete `takeSample` test case.
    
    ## What changes were proposed in this pull request?
    
    This PR fixes some obsolete comments and assertion in `takeSample` testcase 
of `RDDSuite.scala`.
    
    ## How was this patch tested?
    
    This fixes the testcase only.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #13260 from dongjoon-hyun/SPARK-15481.

commit 20900e5feced76e87f0a12823d0e3f07e082105f
Author: Nick Pentreath <[email protected]>
Date:   2016-05-24T18:34:06Z

    [SPARK-15502][DOC][ML][PYSPARK] add guide note that ALS only supports 
integer ids
    
    This PR adds a note to clarify that the ML API for ALS only supports 
integers for user/item ids, and that other types for these columns can be used 
but the ids must fall within integer range.
    
    (Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)).
    
    Also cleaned up a reference to `mllib` in the ML doc.
    
    ## How was this patch tested?
    Built and viewed User Guide doc locally.
    
    Author: Nick Pentreath <[email protected]>
    
    Closes #13278 from MLnick/SPARK-15502-als-int-id-doc-note.

commit e631b819fe348729aab062207a452b8f1d1511bd
Author: Tathagata Das <[email protected]>
Date:   2016-05-24T21:27:39Z

    [SPARK-15458][SQL][STREAMING] Disable schema inference for streaming 
datasets on file streams
    
    ## What changes were proposed in this pull request?
    
    If the user relies on the schema to be inferred in file streams can break 
easily for multiple reasons
    - accidentally running on a directory which has no data
    - schema changing underneath
    - on restart, the query will infer schema again, and may unexpectedly infer 
incorrect schema, as the file in the directory may be different at the time of 
the restart.
    
    To avoid these complicated scenarios, for Spark 2.0, we are going to 
disable schema inferencing by default with a config, so that user is forced to 
consider explicitly what is the schema it wants, rather than the system trying 
to infer it and run into weird corner cases.
    
    In this PR, I introduce a SQLConf that determines whether schema inference 
for file streams is allowed or not. It is disabled by default.
    
    ## How was this patch tested?
    Updated unit tests that test error behavior with and without schema 
inference enabled.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #13238 from tdas/SPARK-15458.

commit f08bf587b1913c6cc8ecb34c45331cf4750961c9
Author: Dongjoon Hyun <[email protected]>
Date:   2016-05-25T01:55:23Z

    [SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException
    
    ## What changes were proposed in this pull request?
    
    Previously, SPARK-8893 added the constraints on positive number of 
partitions for repartition/coalesce operations in general. This PR adds one 
missing part for that and adds explicit two testcases.
    
    **Before**
    ```scala
    scala> sc.parallelize(1 to 5).coalesce(0)
    java.lang.IllegalArgumentException: requirement failed: Number of 
partitions (0) must be positive.
    ...
    scala> sc.parallelize(1 to 5).repartition(0).collect()
    res1: Array[Int] = Array()   // empty
    scala> spark.sql("select 1").coalesce(0)
    res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
    scala> spark.sql("select 1").coalesce(0).collect()
    java.lang.IllegalArgumentException: requirement failed: Number of 
partitions (0) must be positive.
    scala> spark.sql("select 1").repartition(0)
    res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
    scala> spark.sql("select 1").repartition(0).collect()
    res4: Array[org.apache.spark.sql.Row] = Array()  // empty
    ```
    
    **After**
    ```scala
    scala> sc.parallelize(1 to 5).coalesce(0)
    java.lang.IllegalArgumentException: requirement failed: Number of 
partitions (0) must be positive.
    ...
    scala> sc.parallelize(1 to 5).repartition(0)
    java.lang.IllegalArgumentException: requirement failed: Number of 
partitions (0) must be positive.
    ...
    scala> spark.sql("select 1").coalesce(0)
    java.lang.IllegalArgumentException: requirement failed: Number of 
partitions (0) must be positive.
    ...
    scala> spark.sql("select 1").repartition(0)
    java.lang.IllegalArgumentException: requirement failed: Number of 
partitions (0) must be positive.
    ...
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins tests with new testcases.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #13282 from dongjoon-hyun/SPARK-15512.

commit 14494da87bdf057d2d2f796b962a4d8bc4747d31
Author: Reynold Xin <[email protected]>
Date:   2016-05-25T03:55:47Z

    [SPARK-15518] Rename various scheduler backend for consistency
    
    ## What changes were proposed in this pull request?
    This patch renames various scheduler backends to make them consistent:
    
    - LocalScheduler -> LocalSchedulerBackend
    - AppClient -> StandaloneAppClient
    - AppClientListener -> StandaloneAppClientListener
    - SparkDeploySchedulerBackend -> StandaloneSchedulerBackend
    - CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend
    - MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend
    
    ## How was this patch tested?
    Updated test cases to reflect the name change.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #13288 from rxin/SPARK-15518.

commit 4acababcaba567c85f3be0d5e939d99119b82d1d
Author: Parth Brahmbhatt <[email protected]>
Date:   2016-05-25T03:58:20Z

    [SPARK-15365][SQL] When table size statistics are not available from 
metastore, we should fallback to HDFS
    
    ## What changes were proposed in this pull request?
    Currently if a table is used in join operation we rely on Metastore 
returned size to calculate if we can convert the operation to Broadcast join. 
This optimization only kicks in for table's that have the statistics available 
in metastore. Hive generally rolls over to HDFS if the statistics are not 
available directly from metastore and this seems like a reasonable choice to 
adopt given the optimization benefit of using broadcast joins.
    
    ## How was this patch tested?
    I have executed queries locally to test.
    
    Author: Parth Brahmbhatt <[email protected]>
    
    Closes #13150 from Parth-Brahmbhatt/SPARK-15365.

commit 50b660d725269dc0c11e0d350ddd7fc8b19539a0
Author: Wenchen Fan <[email protected]>
Date:   2016-05-25T04:23:39Z

    [SPARK-15498][TESTS] fix slow tests
    
    ## What changes were proposed in this pull request?
    
    This PR fixes 3 slow tests:
    
    1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test 
as it runs more than 5 minutes. This PR removes it and add a new regression 
test in `CodeGenerationSuite`, which is more "unit".
    2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold 
and use smaller data size.
    3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for 
wide table`: Improve `CodeFormatter.format`(introduced at 
https://github.com/apache/spark/pull/12979) can dramatically speed this it up.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #13273 from cloud-fan/test.

commit c9c1c0e54d34773ac2cf5457fe5925559ece36c7
Author: Shixiong Zhu <[email protected]>
Date:   2016-05-25T05:01:40Z

    [SPARK-15508][STREAMING][TESTS] Fix flaky test: 
JavaKafkaStreamSuite.testKafkaStream
    
    ## What changes were proposed in this pull request?
    
    `JavaKafkaStreamSuite.testKafkaStream` assumes when `sent.size == 
result.size`, the contents of `sent` and `result` should be same. However, 
that's not true. The content of `result` may not be the final content.
    
    This PR modified the test to always retry the assertions even if the 
contents of `sent` and `result` are not same.
    
    Here is the failure in Jenkins: 
http://spark-tests.appspot.com/tests/org.apache.spark.streaming.kafka.JavaKafkaStreamSuite/testKafkaStream
    
    ## How was this patch tested?
    
    Jenkins unit tests.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #13281 from zsxwing/flaky-kafka-test.

commit cd9f16906cabd012b7676eb0f524e68a9cbe4db1
Author: Holden Karau <[email protected]>
Date:   2016-05-25T05:20:00Z

    [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression 
pydoc & doc build insturctions
    
    ## What changes were proposed in this pull request?
    
    PySpark: Add links to the predictors from the models in regression.py, 
improve linear and isotonic pydoc in minor ways.
    User guide / R: Switch the installed package list to be enough to build the 
R docs on a "fresh" install on ubuntu and add sudo to match the rest of the 
commands.
    User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 
(e.g. some ubuntu but maybe more).
    
    ## How was this patch tested?
    
    built pydocs locally, tested new user build instructions
    
    Author: Holden Karau <[email protected]>
    
    Closes #13199 from 
holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.

commit 9082b7968ad952e05fc6f4feb499febef6aa45a7
Author: Krishna Kalyan <[email protected]>
Date:   2016-05-25T05:21:52Z

    [SPARK-12071][DOC] Document the behaviour of NA in R
    
    ## What changes were proposed in this pull request?
    
    Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, 
SparkSQL converts `NA` in R to `null`.
    
    ## How was this patch tested?
    
    Document update, no tests.
    
    Author: Krishna Kalyan <[email protected]>
    
    Closes #13268 from krishnakalyan3/spark-12071-1.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13615: Fix the import typo in Python example

Reply via email to