[GitHub] spark pull request #19335: mapPartitions Api

listenLearning Sun, 24 Sep 2017 20:28:17 -0700

GitHub user listenLearning opened a pull request:

    https://github.com/apache/spark/pull/19335


    mapPartitions Api

    
æ¨å¥½ï¼æè¿æå¨å¼åçæ¶åéå°ä¸ä¸ªé®é¢ï¼å°±æ¯å¦ææç¨mappartitionsè¿ä¸ªapiå»åå¨æ°æ®å°hbaseï¼ä¼åºç°ä¸ä¸ªæ¾ä¸å°partitionçéè¯¯ï¼ç¶åè·çå°±ä¼åºç°ä¸ä¸ªæ¾ä¸å°å¹¿æåéçéè¯¯ï¼è¯·é®è¿ä¸ªæ¯ä¸ºä»å¢ï¼ï¼ï¼ä¸ä¸æ¯ä»£ç
 ä»¥åéè¯¯
    def ASpan(span: DataFrame, time: String): Unit = {
        try {
          span.mapPartitions(iter=>{
            iter.map(line => {
              val put = new 
Put(Bytes.toBytes(CreateRowkey.Bit16(line.getString(0)) + "_101301"))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_TIME1PER_30"), Bytes.toBytes(line.getString(1)))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_TIME2PER_30"), Bytes.toBytes(line.getString(2)))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_TIME3PER_30"), Bytes.toBytes(line.getString(3)))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_TIME4PER_30"), Bytes.toBytes(line.getString(4)))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_HASCALL_1"), Bytes.toBytes(line.getLong(5).toString))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_HASCALL_3"), Bytes.toBytes(line.getLong(6).toString))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_HASCALL_6"), Bytes.toBytes(line.getLong(7).toString))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_NOCALL_1"), Bytes.toBytes(line.getLong(8).toString))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_NOCALL_3"), Bytes.toBytes(line.getLong(9).toString))
              put.addColumn(Bytes.toBytes("CF"), 
Bytes.toBytes("CALLDT_NOCALL_6"), Bytes.toBytes(line.getLong(10).toString))
              put.addColumn(Bytes.toBytes("CF"), Bytes.toBytes("DB_TIME"), 
Bytes.toBytes(time))
              (new ImmutableBytesWritable, put)
            })
          }).saveAsNewAPIHadoopDataset(shuliStreaming.indexTable)
        } catch {
          case e: Exception =>
            shuliStreaming.WriteIn.writeLog("shuli", time, "éé»æ&è¿å 
ææ¯å¦éè¯å¨éè¯¯", e)
            e.printStackTrace()
            println("éé»æ&è¿å ææ¯å¦éè¯å¨éè¯¯" + e)
        }
      }
    errorï¼
    17/09/24 23:04:17 INFO spark.CacheManager: Partition rdd_11_1 not found, 
computing it
    17/09/24 23:04:17 INFO rdd.HadoopRDD: Input split: 
hdfs://nameservice1/data/input/common/phlibrary/OFFLINEPHONELIBRARY.dat:1146925+1146926
    17/09/24 23:04:17 INFO broadcast.TorrentBroadcast: Started reading 
broadcast variable 1
    17/09/24 23:04:17 ERROR executor.Executor: Exception in task 1.0 in stage 
250804.0 (TID 3190467)
    java.io.IOException: org.apache.spark.SparkException: Failed to get 
broadcast_1_piece0 of broadcast_1
            at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1223)
            at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
            at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
            at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
            at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
            at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
            at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
            at 
org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:212)
            at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
            at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
            at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
            at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19335.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19335
    
----
commit 0bb8d1f30a7223d2844aa6e733aaacfe2d37eb82
Author: Nick Pentreath <[email protected]>
Date:   2017-08-16T08:54:28Z

    [SPARK-13969][ML] Add FeatureHasher transformer
    
    This PR adds a `FeatureHasher` transformer, modeled on 
[scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html)
 and [Vowpal 
wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction).
    
    The transformer operates on multiple input columns in one pass. Current 
behavior is:
    * for numerical columns, the values are assumed to be real values and the 
feature index is `hash(columnName)` while feature value is `feature_value`
    * for string columns, the values are assumed to be categorical and the 
feature index is `hash(column_name=feature_value)`, while feature value is `1.0`
    * For hash collisions, feature values will be summed
    * `null` (missing) values are ignored
    
    The following dataframe illustrates the basic semantics:
    ```
    
+---+------+-----+---------+------+-----------------------------------------+
    |int|double|float|stringNum|string|features                                 
|
    
+---+------+-----+---------+------+-----------------------------------------+
    |3  |4.0   |5.0  |1        |foo   
|(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
    |6  |7.0   |8.0  |2        |bar   
|(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
    
+---+------+-----+---------+------+-----------------------------------------+
    ```
    
    ## How was this patch tested?
    
    New unit tests and manual experiments.
    
    Author: Nick Pentreath <[email protected]>
    
    Closes #18513 from MLnick/FeatureHasher.

commit adf005dabe3b0060033e1eeaedbab31a868efc8c
Author: John Lee <[email protected]>
Date:   2017-08-16T14:44:09Z

    [SPARK-21656][CORE] spark dynamic allocation should not idle timeout 
executors when tasks still to run
    
    ## What changes were proposed in this pull request?
    
    Right now spark lets go of executors when they are idle for the 60s (or 
configurable time). I have seen spark let them go when they are idle but they 
were really needed. I have seen this issue when the scheduler was waiting to 
get node locality but that takes longer than the default idle timeout. In these 
jobs the number of executors goes down really small (less than 10) but there 
are still like 80,000 tasks to run.
    We should consider not allowing executors to idle timeout if they are still 
needed according to the number of tasks to be run.
    
    ## How was this patch tested?
    
    Tested by manually adding executors to `executorsIdsToBeRemoved` list and 
seeing if those executors were removed when there are a lot of tasks and a high 
`numExecutorsTarget` value.
    
    Code used
    
    In  `ExecutorAllocationManager.start()`
    
    ```
        start_time = clock.getTimeMillis()
    ```
    
    In `ExecutorAllocationManager.schedule()`
    ```
        val executorIdsToBeRemoved = ArrayBuffer[String]()
        if ( now > start_time + 1000 * 60 * 2) {
          logInfo("--- REMOVING 1/2 of the EXECUTORS ---")
          start_time +=  1000 * 60 * 100
          var counter = 0
          for (x <- executorIds) {
            counter += 1
            if (counter == 2) {
              counter = 0
              executorIdsToBeRemoved += x
            }
          }
        }
    
    Author: John Lee <[email protected]>
    
    Closes #18874 from yoonlee95/SPARK-21656.

commit 1cce1a3b639c5c793d43fa51a8ec3e0fef622a40
Author: 10129659 <[email protected]>
Date:   2017-08-16T16:12:20Z

    [SPARK-21603][SQL] The wholestage codegen will be much slower then that is 
closed when the function is too long
    
    ## What changes were proposed in this pull request?
    Close the whole stage codegen when the function lines is longer than the 
maxlines which will be setted by
    spark.sql.codegen.MaxFunctionLength parameter, because when the function is 
too long , it will not get the JIT  optimizing.
    A benchmark test result is 10x slower when the generated function is too 
long :
    
    ignore("max function length of wholestagecodegen") {
        val N = 20 << 15
    
        val benchmark = new Benchmark("max function length of 
wholestagecodegen", N)
        def f(): Unit = sparkSession.range(N)
          .selectExpr(
            "id",
            "(id & 1023) as k1",
            "cast(id & 1023 as double) as k2",
            "cast(id & 1023 as int) as k3",
            "case when id > 100 and id <= 200 then 1 else 0 end as v1",
            "case when id > 200 and id <= 300 then 1 else 0 end as v2",
            "case when id > 300 and id <= 400 then 1 else 0 end as v3",
            "case when id > 400 and id <= 500 then 1 else 0 end as v4",
            "case when id > 500 and id <= 600 then 1 else 0 end as v5",
            "case when id > 600 and id <= 700 then 1 else 0 end as v6",
            "case when id > 700 and id <= 800 then 1 else 0 end as v7",
            "case when id > 800 and id <= 900 then 1 else 0 end as v8",
            "case when id > 900 and id <= 1000 then 1 else 0 end as v9",
            "case when id > 1000 and id <= 1100 then 1 else 0 end as v10",
            "case when id > 1100 and id <= 1200 then 1 else 0 end as v11",
            "case when id > 1200 and id <= 1300 then 1 else 0 end as v12",
            "case when id > 1300 and id <= 1400 then 1 else 0 end as v13",
            "case when id > 1400 and id <= 1500 then 1 else 0 end as v14",
            "case when id > 1500 and id <= 1600 then 1 else 0 end as v15",
            "case when id > 1600 and id <= 1700 then 1 else 0 end as v16",
            "case when id > 1700 and id <= 1800 then 1 else 0 end as v17",
            "case when id > 1800 and id <= 1900 then 1 else 0 end as v18")
          .groupBy("k1", "k2", "k3")
          .sum()
          .collect()
    
        benchmark.addCase(s"codegen = F") { iter =>
          sparkSession.conf.set("spark.sql.codegen.wholeStage", "false")
          f()
        }
    
        benchmark.addCase(s"codegen = T") { iter =>
          sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
          sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "10000")
          f()
        }
    
        benchmark.run()
    
        /*
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1
        Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
        max function length of wholestagecodegen: Best/Avg Time(ms)    
Rate(M/s)   Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        codegen = F                                    443 /  507          1.5  
       676.0       1.0X
        codegen = T                                   3279 / 3283          0.2  
      5002.6       0.1X
         */
      }
    
    ## How was this patch tested?
    Run the unit test
    
    Author: 10129659 <[email protected]>
    
    Closes #18810 from eatoncys/codegen.

commit 7add4e982184c46990995dcd1326e6caad7adf6e
Author: Marco Gaido <[email protected]>
Date:   2017-08-16T16:40:04Z

    [SPARK-21738] Thriftserver doesn't cancel jobs when session is closed
    
    ## What changes were proposed in this pull request?
    
    When a session is closed the Thriftserver doesn't cancel the jobs which may 
still be running. This is a huge waste of resources.
    This PR address the problem canceling the pending jobs when a session is 
closed.
    
    ## How was this patch tested?
    
    The patch was tested manually.
    
    Author: Marco Gaido <[email protected]>
    
    Closes #18951 from mgaido91/SPARK-21738.

commit a0345cbebe23537df4084cf90f9d47425e550150
Author: Peng Meng <[email protected]>
Date:   2017-08-16T18:05:20Z

    [SPARK-21680][ML][MLLIB] optimize Vector compress
    
    ## What changes were proposed in this pull request?
    
    When use Vector.compressed to change a Vector to SparseVector, the 
performance is very low comparing with Vector.toSparse.
    This is because you have to scan the value three times using 
Vector.compressed, but you just need two times when use Vector.toSparse.
    When the length of the vector is large, there is significant performance 
difference between this two method.
    
    ## How was this patch tested?
    
    The existing UT
    
    Author: Peng Meng <[email protected]>
    
    Closes #18899 from mpjlu/optVectorCompress.

commit b8ffb51055108fd606b86f034747006962cd2df3
Author: Eyal Farago <[email protected]>
Date:   2017-08-17T01:21:50Z

    [SPARK-3151][BLOCK MANAGER] DiskStore.getBytes fails for files larger than 
2GB
    
    ## What changes were proposed in this pull request?
    introduced `DiskBlockData`, a new implementation of `BlockData` 
representing a whole file.
    this is somehow related to 
[SPARK-6236](https://issues.apache.org/jira/browse/SPARK-6236) as well
    
    This class follows the implementation of `EncryptedBlockData` just without 
the encryption. hence:
    * `toInputStream` is implemented using a `FileInputStream` (todo: encrypted 
version actually uses `Channels.newInputStream`, not sure if it's the right 
choice for this)
    * `toNetty` is implemented in terms of `io.netty.channel.DefaultFileRegion`
    * `toByteBuffer` fails for files larger than 2GB (same behavior of the 
original code, just postponed a bit), it also respects the same configuration 
keys defined by the original code to choose between memory mapping and simple 
file read.
    
    ## How was this patch tested?
    added test to DiskStoreSuite and MemoryManagerSuite
    
    Author: Eyal Farago <[email protected]>
    
    Closes #18855 from eyalfa/SPARK-3151.

commit a45133b826984b7856e16d754e01c82702016af7
Author: Wenchen Fan <[email protected]>
Date:   2017-08-17T05:37:45Z

    [SPARK-21743][SQL] top-most limit should not cause memory leak
    
    ## What changes were proposed in this pull request?
    
    For top-most limit, we will use a special operator to execute it: 
`CollectLimitExec`.
    
    `CollectLimitExec` will retrieve `n`(which is the limit) rows from each 
partition of the child plan output, see 
https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L311.
 It's very likely that we don't exhaust the child plan output.
    
    This is fine when whole-stage-codegen is off, as child plan will release 
the resource via task completion listener. However, when whole-stage codegen is 
on, the resource can only be released if all output is consumed.
    
    To fix this memory leak, one simple approach is, when `CollectLimitExec` 
retrieve `n` rows from child plan output, child plan output should only have 
`n` rows, then the output is exhausted and resource is released. This can be 
done by wrapping child plan with `LocalLimit`
    
    ## How was this patch tested?
    
    a regression test
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #18955 from cloud-fan/leak.

commit d695a528bef6291e0e1657f4f3583a8371abd7c8
Author: Hideaki Tanaka <[email protected]>
Date:   2017-08-17T14:02:13Z

    [SPARK-21642][CORE] Use FQDN for DRIVER_HOST_ADDRESS instead of ip address
    
    ## What changes were proposed in this pull request?
    
    The patch lets spark web ui use FQDN as its hostname instead of ip address.
    
    In current implementation, ip address of a driver host is set to 
DRIVER_HOST_ADDRESS. This becomes a problem when we enable SSL using 
"spark.ssl.enabled", "spark.ssl.trustStore" and "spark.ssl.keyStore" 
properties. When we configure these properties, spark web ui is launched with 
SSL enabled and the HTTPS server is configured with the custom SSL certificate 
you configured in these properties.
    In this case, client gets javax.net.ssl.SSLPeerUnverifiedException 
exception when the client accesses the spark web ui because the client fails to 
verify the SSL certificate (Common Name of the SSL cert does not match with 
DRIVER_HOST_ADDRESS).
    
    To avoid the exception, we should use FQDN of the driver host for 
DRIVER_HOST_ADDRESS.
    
    Error message that client gets when the client accesses spark web ui:
    javax.net.ssl.SSLPeerUnverifiedException: Certificate for <10.102.138.239> 
doesn't match any of the subject alternative names: []
    
    ## How was this patch tested?
    manual tests
    
    Author: Hideaki Tanaka <[email protected]>
    
    Closes #18846 from thideeeee/SPARK-21642.

commit b83b502c4189c571bda776511c6f7541c6067aae
Author: Kent Yao <[email protected]>
Date:   2017-08-17T16:24:45Z

    [SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars 
for reusing CliSessionState
    
    ## What changes were proposed in this pull request?
    
    Set isolated to false while using builtin hive jars and `SessionState.get` 
returns a `CliSessionState` instance.
    
    ## How was this patch tested?
    
    1 Unit Tests
    2 Manually verified: `hive.exec.strachdir` was only created once because of 
reusing cliSessionState
    ```java
    â  spark git:(SPARK-21428) â bin/spark-sql --conf 
spark.sql.hive.metastore.jars=builtin
    
    log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.util.Shell).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for 
more info.
    Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
    17/07/16 23:59:27 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
    17/07/16 23:59:27 INFO HiveMetaStore: 0: Opening raw store with 
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
    17/07/16 23:59:27 INFO ObjectStore: ObjectStore, initialize called
    17/07/16 23:59:28 INFO Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
    17/07/16 23:59:28 INFO Persistence: Property datanucleus.cache.level2 
unknown - will be ignored
    17/07/16 23:59:29 INFO ObjectStore: Setting MetaStore object pin classes 
with 
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
    17/07/16 23:59:30 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
    17/07/16 23:59:30 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
    17/07/16 23:59:31 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
    17/07/16 23:59:31 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
    17/07/16 23:59:31 INFO MetaStoreDirectSql: Using direct SQL, underlying DB 
is DERBY
    17/07/16 23:59:31 INFO ObjectStore: Initialized ObjectStore
    17/07/16 23:59:31 WARN ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 1.2.0
    17/07/16 23:59:31 WARN ObjectStore: Failed to get database default, 
returning NoSuchObjectException
    17/07/16 23:59:32 INFO HiveMetaStore: Added admin role in metastore
    17/07/16 23:59:32 INFO HiveMetaStore: Added public role in metastore
    17/07/16 23:59:32 INFO HiveMetaStore: No user is added in admin role, since 
config is empty
    17/07/16 23:59:32 INFO HiveMetaStore: 0: get_all_databases
    17/07/16 23:59:32 INFO audit: ugi=Kent      ip=unknown-ip-addr      
cmd=get_all_databases
    17/07/16 23:59:32 INFO HiveMetaStore: 0: get_functions: db=default pat=*
    17/07/16 23:59:32 INFO audit: ugi=Kent      ip=unknown-ip-addr      
cmd=get_functions: db=default pat=*
    17/07/16 23:59:32 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as 
"embedded-only" so does not have its own datastore table.
    17/07/16 23:59:32 INFO SessionState: Created local directory: 
/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/beea7261-221a-4711-89e8-8b12a9d37370_resources
    17/07/16 23:59:32 INFO SessionState: Created HDFS directory: 
/tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370
    17/07/16 23:59:32 INFO SessionState: Created local directory: 
/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/Kent/beea7261-221a-4711-89e8-8b12a9d37370
    17/07/16 23:59:32 INFO SessionState: Created HDFS directory: 
/tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370/_tmp_space.db
    17/07/16 23:59:32 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
    17/07/16 23:59:32 INFO SparkContext: Submitted application: 
SparkSQL::10.0.0.8
    17/07/16 23:59:32 INFO SecurityManager: Changing view acls to: Kent
    17/07/16 23:59:32 INFO SecurityManager: Changing modify acls to: Kent
    17/07/16 23:59:32 INFO SecurityManager: Changing view acls groups to:
    17/07/16 23:59:32 INFO SecurityManager: Changing modify acls groups to:
    17/07/16 23:59:32 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(Kent); groups 
with view permissions: Set(); users  with modify permissions: Set(Kent); groups 
with modify permissions: Set()
    17/07/16 23:59:33 INFO Utils: Successfully started service 'sparkDriver' on 
port 51889.
    17/07/16 23:59:33 INFO SparkEnv: Registering MapOutputTracker
    17/07/16 23:59:33 INFO SparkEnv: Registering BlockManagerMaster
    17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: 
BlockManagerMasterEndpoint up
    17/07/16 23:59:33 INFO DiskBlockManager: Created local directory at 
/private/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/blockmgr-9cfae28a-01e9-4c73-a1f1-f76fa52fc7a5
    17/07/16 23:59:33 INFO MemoryStore: MemoryStore started with capacity 366.3 
MB
    17/07/16 23:59:33 INFO SparkEnv: Registering OutputCommitCoordinator
    17/07/16 23:59:33 INFO Utils: Successfully started service 'SparkUI' on 
port 4040.
    17/07/16 23:59:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://10.0.0.8:4040
    17/07/16 23:59:33 INFO Executor: Starting executor ID driver on host 
localhost
    17/07/16 23:59:33 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 51890.
    17/07/16 23:59:33 INFO NettyBlockTransferService: Server created on 
10.0.0.8:51890
    17/07/16 23:59:33 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy
    17/07/16 23:59:33 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(driver, 10.0.0.8, 51890, None)
    17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Registering block 
manager 10.0.0.8:51890 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.8, 
51890, None)
    17/07/16 23:59:33 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(driver, 10.0.0.8, 51890, None)
    17/07/16 23:59:33 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(driver, 10.0.0.8, 51890, None)
    17/07/16 23:59:34 INFO SharedState: Setting hive.metastore.warehouse.dir 
('null') to the value of spark.sql.warehouse.dir 
('file:/Users/Kent/Documents/spark/spark-warehouse').
    17/07/16 23:59:34 INFO SharedState: Warehouse path is 
'file:/Users/Kent/Documents/spark/spark-warehouse'.
    17/07/16 23:59:34 INFO HiveUtils: Initializing HiveMetastoreConnection 
version 1.2.1 using Spark classes.
    17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client 
(version 1.2.2) is /user/hive/warehouse
    17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: default
    17/07/16 23:59:34 INFO audit: ugi=Kent      ip=unknown-ip-addr      
cmd=get_database: default
    17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client 
(version 1.2.2) is /user/hive/warehouse
    17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: global_temp
    17/07/16 23:59:34 INFO audit: ugi=Kent      ip=unknown-ip-addr      
cmd=get_database: global_temp
    17/07/16 23:59:34 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
    17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client 
(version 1.2.2) is /user/hive/warehouse
    17/07/16 23:59:34 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
    spark-sql>
    
    ```
    cc cloud-fan gatorsmile
    
    Author: Kent Yao <[email protected]>
    Author: hzyaoqin <[email protected]>
    
    Closes #18648 from yaooqinn/SPARK-21428.

commit ae9e42479253a9cd30423476405377f2d7952137
Author: gatorsmile <[email protected]>
Date:   2017-08-17T20:00:37Z

    [SQL][MINOR][TEST] Set spark.unsafe.exceptionOnMemoryLeak to true
    
    ## What changes were proposed in this pull request?
    When running IntelliJ, we are unable to capture the exception of memory 
leak detection.
    > org.apache.spark.executor.Executor: Managed memory leak detected
    
    Explicitly setting `spark.unsafe.exceptionOnMemoryLeak` in SparkConf when 
building the SparkSession, instead of reading it from system properties.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <[email protected]>
    
    Closes #18967 from gatorsmile/setExceptionOnMemoryLeak.

commit 6aad02d03632df964363a144c96371e86f7b207e
Author: Takeshi Yamamuro <[email protected]>
Date:   2017-08-17T20:47:14Z

    [SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent
    
    ## What changes were proposed in this pull request?
    This pr sorted output attributes on their name and exprId in 
`AttributeSet.toSeq` to make the order consistent.  If the order is different, 
spark possibly generates different code and then misses cache in 
`CodeGenerator`, e.g., `GenerateColumnAccessor` generates code depending on an 
input attribute order.
    
    ## How was this patch tested?
    Added tests in `AttributeSetSuite` and manually checked if the cache worked 
well in the given query of the JIRA.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #18959 from maropu/SPARK-18394.

commit bfdc361ededb2ed4e323f075fdc40ed004b7f41d
Author: ArtRand <[email protected]>
Date:   2017-08-17T22:47:07Z

    [SPARK-16742] Mesos Kerberos Support
    
    ## What changes were proposed in this pull request?
    
    Add Kerberos Support to Mesos.   This includes kinit and --keytab support, 
but does not include delegation token renewal.
    
    ## How was this patch tested?
    
    Manually against a Secure DC/OS Apache HDFS cluster.
    
    Author: ArtRand <[email protected]>
    Author: Michael Gummelt <[email protected]>
    
    Closes #18519 from mgummelt/SPARK-16742-kerberos.

commit 7ab951885fd34aa8184b70a3a39b865a239e5052
Author: Jen-Ming Chung <[email protected]>
Date:   2017-08-17T22:59:45Z

    [SPARK-21677][SQL] json_tuple throws NullPointException when column is null 
as string type
    
    ## What changes were proposed in this pull request?
    ``` scala
    scala> Seq(("""{"Hyukjin": 224, "John": 
1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
    ...
    java.lang.NullPointerException
        at ...
    ```
    
    Currently the `null` field name will throw NullPointException. As a given 
field name null can't be matched with any field names in json, we just output 
null as its column value. This PR achieves it by returning a very unlikely 
column name `__NullFieldName` in evaluation of the field names.
    
    ## How was this patch tested?
    Added unit test.
    
    Author: Jen-Ming Chung <[email protected]>
    
    Closes #18930 from jmchung/SPARK-21677.

commit 2caaed970e3e26ae59be5999516a737aff3e5c78
Author: gatorsmile <[email protected]>
Date:   2017-08-17T23:33:39Z

    [SPARK-21767][TEST][SQL] Add Decimal Test For Avro in VersionSuite
    
    ## What changes were proposed in this pull request?
    Decimal is a logical type of AVRO. We need to ensure the support of Hive's 
AVRO serde works well in Spark
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <[email protected]>
    
    Closes #18977 from gatorsmile/addAvroTest.

commit 310454be3b0ce5ff6b6ef0070c5daadf6fb16927
Author: donnyzone <[email protected]>
Date:   2017-08-18T05:37:32Z

    [SPARK-21739][SQL] Cast expression should initialize timezoneId when it is 
called statically to convert something into TimestampType
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739
    
    This issue is caused by introducing TimeZoneAwareExpression.
    When the **Cast** expression converts something into TimestampType, it 
should be resolved with setting `timezoneId`. In general, it is resolved in 
LogicalPlan phase.
    
    However, there are still some places that use Cast expression statically to 
convert datatypes without setting `timezoneId`. In such cases,  
`NoSuchElementException: None.get` will be thrown for TimestampType.
    
    This PR is proposed to fix the issue. We have checked the whole project and 
found two such usages(i.e., in`TableReader` and `HiveTableScanExec`).
    
    ## How was this patch tested?
    
    unit test
    
    Author: donnyzone <[email protected]>
    
    Closes #18960 from DonnyZone/spark-21739.

commit 07a2b8738ed8e6c136545d03f91a865de05e41a0
Author: Reynold Xin <[email protected]>
Date:   2017-08-18T14:58:20Z

    [SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java
    
    ## What changes were proposed in this pull request?
    Dataset.sample requires a boolean flag withReplacement as the first 
argument. However, most of the time users simply want to sample some records 
without replacement. This ticket introduces a new sample function that simply 
takes in the fraction and seed.
    
    ## How was this patch tested?
    Tested manually. Not sure yet if we should add a test case for just this 
wrapper ...
    
    Author: Reynold Xin <[email protected]>
    
    Closes #18988 from rxin/SPARK-21778.

commit 23ea8980809497d0372084adf5936602655e1685
Author: Masha Basmanova <[email protected]>
Date:   2017-08-18T16:54:39Z

    [SPARK-21213][SQL] Support collecting partition-level statistics: rowCount 
and sizeInBytes
    
    ## What changes were proposed in this pull request?
    
    Added support for ANALYZE TABLE [db_name].tablename PARTITION 
(partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command 
to calculate total number of rows and size in bytes for a subset of partitions. 
Calculated statistics are stored in Hive Metastore as user-defined properties 
attached to partition objects. Property names are the same as the ones used to 
store table-level statistics: spark.sql.statistics.totalSize and 
spark.sql.statistics.numRows.
    
    When partition specification contains all partition columns with values, 
the command collects statistics for a single partition that matches the 
specification. When some partition columns are missing or listed without their 
values, the command collects statistics for all partitions which match a subset 
of partition column values specified.
    
    For example, table t has 4 partitions with the following specs:
    
    * Partition1: (ds='2008-04-08', hr=11)
    * Partition2: (ds='2008-04-08', hr=12)
    * Partition3: (ds='2008-04-09', hr=11)
    * Partition4: (ds='2008-04-09', hr=12)
    
    'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect 
statistics only for partition 3.
    
    'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect 
statistics for partitions 3 and 4.
    
    'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for 
all four partitions.
    
    When the optional parameter NOSCAN is specified, the command doesn't count 
number of rows and only gathers size in bytes.
    
    The statistics gathered by ANALYZE TABLE command can be fetched using DESC 
EXTENDED [db_name.]tablename PARTITION command.
    
    ## How was this patch tested?
    
    Added tests.
    
    Author: Masha Basmanova <[email protected]>
    
    Closes #18421 from mbasmanova/mbasmanova-analyze-partition.

commit 7880909c45916ab76dccac308a9b2c5225a00e89
Author: Wenchen Fan <[email protected]>
Date:   2017-08-18T18:19:22Z

    [SPARK-21743][SQL][FOLLOW-UP] top-most limit should not cause memory leak
    
    ## What changes were proposed in this pull request?
    
    This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix 
a bug that we break whole stage codegen for `Limit`.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #18993 from cloud-fan/bug.

commit a2db5c5761b0c72babe48b79859d3b208ee8e9f6
Author: Andrew Ash <[email protected]>
Date:   2017-08-18T20:43:42Z

    [MINOR][TYPO] Fix typos: runnning and Excecutors
    
    ## What changes were proposed in this pull request?
    
    Fix typos
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Andrew Ash <[email protected]>
    
    Closes #18996 from ash211/patch-2.

commit 10be01848ef28004a287940a4e8d8a044e14b257
Author: Andrew Ray <[email protected]>
Date:   2017-08-19T01:10:54Z

    [SPARK-21566][SQL][PYTHON] Python method for summary
    
    ## What changes were proposed in this pull request?
    
    Adds the recently added `summary` method to the python dataframe interface.
    
    ## How was this patch tested?
    
    Additional inline doctests.
    
    Author: Andrew Ray <[email protected]>
    
    Closes #18762 from aray/summary-py.

commit 72b738d8dcdb7893003c81bf1c73bbe262852d1a
Author: Yuming Wang <[email protected]>
Date:   2017-08-19T18:41:32Z

    [SPARK-21790][TESTS] Fix Docker-based Integration Test errors.
    
    ## What changes were proposed in this pull request?
    
[SPARK-17701](https://github.com/apache/spark/pull/18600/files#diff-b9f96d092fb3fea76bcf75e016799678L77)
 removed `metadata` function, this PR removed the Docker-based Integration 
module that has been relevant to `SparkPlan.metadata`.
    
    ## How was this patch tested?
    manual tests
    
    Author: Yuming Wang <[email protected]>
    
    Closes #19000 from wangyum/SPARK-21709.

commit 73e04ecc4f29a0fe51687ed1337c61840c976f89
Author: CÃ©dric Pelvet <[email protected]>
Date:   2017-08-20T10:05:54Z

    [MINOR] Correct validateAndTransformSchema in GaussianMixture and 
AFTSurvivalRegression
    
    ## What changes were proposed in this pull request?
    
    The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) 
did not modify the variable schema, hence only the last line had any effect. A 
temporary variable is used to correctly append the two columns predictionCol 
and probabilityCol.
    
    ## How was this patch tested?
    
    Manually.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: CÃ©dric Pelvet <[email protected]>
    
    Closes #18980 from sharp-pixel/master.

commit 41e0eb71a63140c9a44a7d2f32821f02abd62367
Author: hyukjinkwon <[email protected]>
Date:   2017-08-20T10:48:04Z

    [SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL 
documentation build
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to install `mkdocs` by `pip install` if missing in the 
path. Mainly to fix Jenkins's documentation build failure in 
`spark-master-docs`. See 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console.
    
    It also adds `mkdocs` as requirements in `docs/README.md`.
    
    ## How was this patch tested?
    
    I manually ran `jekyll build` under `docs` directory after manually 
removing `mkdocs` via `pip uninstall mkdocs`.
    
    Also, tested this in the same way but on CentOS Linux release 7.3.1611 
(Core) where I built Spark few times but never built documentation before and 
`mkdocs` is not installed.
    
    ```
    ...
    Moving back into docs dir.
    Moving to SQL directory and building docs.
    Missing mkdocs in your path, trying to install mkdocs for SQL documentation 
generation.
    Collecting mkdocs
      Downloading mkdocs-0.16.3-py2.py3-none-any.whl (1.2MB)
        100% 
|ââââââââââââââââââââââââââââââââ|
 1.2MB 574kB/s
    Requirement already satisfied: PyYAML>=3.10 in 
/usr/lib64/python2.7/site-packages (from mkdocs)
    Collecting livereload>=2.5.1 (from mkdocs)
      Downloading livereload-2.5.1-py2-none-any.whl
    Collecting tornado>=4.1 (from mkdocs)
      Downloading tornado-4.5.1.tar.gz (483kB)
        100% 
|ââââââââââââââââââââââââââââââââ|
 491kB 1.4MB/s
    Collecting Markdown>=2.3.1 (from mkdocs)
      Downloading Markdown-2.6.9.tar.gz (271kB)
        100% 
|ââââââââââââââââââââââââââââââââ|
 276kB 2.4MB/s
    Collecting click>=3.3 (from mkdocs)
      Downloading click-6.7-py2.py3-none-any.whl (71kB)
        100% 
|ââââââââââââââââââââââââââââââââ|
 71kB 2.8MB/s
    Requirement already satisfied: Jinja2>=2.7.1 in 
/usr/lib/python2.7/site-packages (from mkdocs)
    Requirement already satisfied: six in /usr/lib/python2.7/site-packages 
(from livereload>=2.5.1->mkdocs)
    Requirement already satisfied: backports.ssl_match_hostname in 
/usr/lib/python2.7/site-packages (from tornado>=4.1->mkdocs)
    Collecting singledispatch (from tornado>=4.1->mkdocs)
      Downloading singledispatch-3.4.0.3-py2.py3-none-any.whl
    Collecting certifi (from tornado>=4.1->mkdocs)
      Downloading certifi-2017.7.27.1-py2.py3-none-any.whl (349kB)
        100% 
|ââââââââââââââââââââââââââââââââ|
 358kB 2.1MB/s
    Collecting backports_abc>=0.4 (from tornado>=4.1->mkdocs)
      Downloading backports_abc-0.5-py2.py3-none-any.whl
    Requirement already satisfied: MarkupSafe>=0.23 in 
/usr/lib/python2.7/site-packages (from Jinja2>=2.7.1->mkdocs)
    Building wheels for collected packages: tornado, Markdown
      Running setup.py bdist_wheel for tornado ... done
      Stored in directory: 
/root/.cache/pip/wheels/84/83/cd/6a04602633457269d161344755e6766d24307189b7a67ff4b7
      Running setup.py bdist_wheel for Markdown ... done
      Stored in directory: 
/root/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5
    Successfully built tornado Markdown
    Installing collected packages: singledispatch, certifi, backports-abc, 
tornado, livereload, Markdown, click, mkdocs
    Successfully installed Markdown-2.6.9 backports-abc-0.5 certifi-2017.7.27.1 
click-6.7 livereload-2.5.1 mkdocs-0.16.3 singledispatch-3.4.0.3 tornado-4.5.1
    Generating markdown files for SQL documentation.
    Generating HTML files for SQL documentation.
    INFO    -  Cleaning site directory
    INFO    -  Building documentation to directory: .../spark/sql/site
    Moving back into docs dir.
    Making directory api/sql
    cp -r ../sql/site/. api/sql
                Source: .../spark/docs
           Destination: .../spark/docs/_site
          Generating...
                        done.
     Auto-regeneration: disabled. Use --watch to enable.
     ```
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #18984 from HyukjinKwon/sql-doc-mkdocs.

commit 28a6cca7df900d13613b318c07acb97a5722d2b8
Author: Liang-Chi Hsieh <[email protected]>
Date:   2017-08-20T16:45:23Z

    [SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths 
are successfully removed
    
    ## What changes were proposed in this pull request?
    
    Fix a typo in test.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #19005 from viirya/SPARK-21721-followup.

commit 77d046ec47a9bfa6323aa014869844c28e18e049
Author: Sergey Serebryakov <[email protected]>
Date:   2017-08-21T07:21:25Z

    [SPARK-21782][CORE] Repartition creates skews when numPartitions is a power 
of 2
    
    ## Problem
    When an RDD (particularly with a low item-per-partition ratio) is 
repartitioned to numPartitions = power of 2, the resulting partitions are very 
uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG 
only once. See details in https://issues.apache.org/jira/browse/SPARK-21782
    
    ## What changes were proposed in this pull request?
    Instead of directly using `0, 1, 2,...` seeds to initialize `Random`, hash 
them with `scala.util.hashing.byteswap32()`.
    
    ## How was this patch tested?
    `build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test`
    
    Author: Sergey Serebryakov <[email protected]>
    
    Closes #18990 from megaserg/repartition-skew.

commit b3a07526fe774fd64fe3a2b9a2381eff9a3c49a3
Author: Sean Owen <[email protected]>
Date:   2017-08-21T12:20:40Z

    [SPARK-21718][SQL] Heavy log of type: "Skipping partition based on stats 
..."
    
    ## What changes were proposed in this pull request?
    
    Reduce 'Skipping partitions' message to debug
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <[email protected]>
    
    Closes #19010 from srowen/SPARK-21718.

commit 988b84d7ed43bea2616527ff050dffcf20548ab2
Author: Nick Pentreath <[email protected]>
Date:   2017-08-21T12:35:38Z

    [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher
    
    Add Python API for `FeatureHasher` transformer.
    
    ## How was this patch tested?
    
    New doc test.
    
    Author: Nick Pentreath <[email protected]>
    
    Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.

commit ba843292e37368e1f5e4ae5c99ba1f5f90ca6025
Author: Yuming Wang <[email protected]>
Date:   2017-08-21T17:16:56Z

    [SPARK-21790][TESTS][FOLLOW-UP] Add filter pushdown verification back.
    
    ## What changes were proposed in this pull request?
    
    The previous PR(https://github.com/apache/spark/pull/19000) removed filter 
pushdown verification, This PR add them back.
    
    ## How was this patch tested?
    manual tests
    
    Author: Yuming Wang <[email protected]>
    
    Closes #19002 from wangyum/SPARK-21790-follow-up.

commit 84b5b16ea6c9816c70f7471a50eb5e4acb7fb1a1
Author: Marcelo Vanzin <[email protected]>
Date:   2017-08-21T22:09:02Z

    [SPARK-21617][SQL] Store correct table metadata when altering schema in 
Hive metastore.
    
    For Hive tables, the current "replace the schema" code is the correct
    path, except that an exception in that path should result in an error, and
    not in retrying in a different way.
    
    For data source tables, Spark may generate a non-compatible Hive table;
    but for that to work with Hive 2.1, the detection of data source tables 
needs
    to be fixed in the Hive client, to also consider the raw tables used by code
    such as `alterTableSchema`.
    
    Tested with existing and added unit tests (plus internal tests with a 2.1 
metastore).
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #18849 from vanzin/SPARK-21617.

commit c108a5d30e821fef23709681fca7da22bc507129
Author: Yanbo Liang <[email protected]>
Date:   2017-08-22T00:43:18Z

    [SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization.
    
    ## What changes were proposed in this pull request?
    MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always 
standardize the data during training to improve the rate of convergence 
regardless of _standardization_ is true or false. If _standardization_ is 
false, we perform reverse standardization by penalizing each component 
differently to get effectively the same objective function when the training 
dataset is not standardized. We should keep these comments in the code to let 
developers understand how we handle it correctly.
    
    ## How was this patch tested?
    Existing tests, only adding some comments in code.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #18992 from yanboliang/SPARK-19762.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19335: mapPartitions Api

Reply via email to