GitHub user listenLearning opened a pull request:
https://github.com/apache/spark/pull/19335
mapPartitions Api
æ¨å¥½ï¼æè¿æå¨å¼åçæ¶åéå°ä¸ä¸ªé®é¢ï¼å°±æ¯å¦ææç¨mappartitionsè¿ä¸ªapiå»å卿°æ®å°hbaseï¼ä¼åºç°ä¸ä¸ªæ¾ä¸å°partitionçé误ï¼ç¶åè·çå°±ä¼åºç°ä¸ä¸ªæ¾ä¸å°å¹¿æåéçé误ï¼è¯·é®è¿ä¸ªæ¯ä¸ºä»å¢ï¼ï¼ï¼ä¸ä¸æ¯ä»£ç
以åé误
def ASpan(span: DataFrame, time: String): Unit = {
try {
span.mapPartitions(iter=>{
iter.map(line => {
val put = new
Put(Bytes.toBytes(CreateRowkey.Bit16(line.getString(0)) + "_101301"))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_TIME1PER_30"), Bytes.toBytes(line.getString(1)))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_TIME2PER_30"), Bytes.toBytes(line.getString(2)))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_TIME3PER_30"), Bytes.toBytes(line.getString(3)))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_TIME4PER_30"), Bytes.toBytes(line.getString(4)))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_HASCALL_1"), Bytes.toBytes(line.getLong(5).toString))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_HASCALL_3"), Bytes.toBytes(line.getLong(6).toString))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_HASCALL_6"), Bytes.toBytes(line.getLong(7).toString))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_NOCALL_1"), Bytes.toBytes(line.getLong(8).toString))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_NOCALL_3"), Bytes.toBytes(line.getLong(9).toString))
put.addColumn(Bytes.toBytes("CF"),
Bytes.toBytes("CALLDT_NOCALL_6"), Bytes.toBytes(line.getLong(10).toString))
put.addColumn(Bytes.toBytes("CF"), Bytes.toBytes("DB_TIME"),
Bytes.toBytes(time))
(new ImmutableBytesWritable, put)
})
}).saveAsNewAPIHadoopDataset(shuliStreaming.indexTable)
} catch {
case e: Exception =>
shuliStreaming.WriteIn.writeLog("shuli", time, "é黿&è¿å
ææ¯å¦éè¯å¨é误", e)
e.printStackTrace()
println("é黿&è¿å ææ¯å¦éè¯å¨é误" + e)
}
}
errorï¼
17/09/24 23:04:17 INFO spark.CacheManager: Partition rdd_11_1 not found,
computing it
17/09/24 23:04:17 INFO rdd.HadoopRDD: Input split:
hdfs://nameservice1/data/input/common/phlibrary/OFFLINEPHONELIBRARY.dat:1146925+1146926
17/09/24 23:04:17 INFO broadcast.TorrentBroadcast: Started reading
broadcast variable 1
17/09/24 23:04:17 ERROR executor.Executor: Exception in task 1.0 in stage
250804.0 (TID 3190467)
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1223)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
at
org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19335.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19335
----
commit 0bb8d1f30a7223d2844aa6e733aaacfe2d37eb82
Author: Nick Pentreath <[email protected]>
Date: 2017-08-16T08:54:28Z
[SPARK-13969][ML] Add FeatureHasher transformer
This PR adds a `FeatureHasher` transformer, modeled on
[scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html)
and [Vowpal
wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction).
The transformer operates on multiple input columns in one pass. Current
behavior is:
* for numerical columns, the values are assumed to be real values and the
feature index is `hash(columnName)` while feature value is `feature_value`
* for string columns, the values are assumed to be categorical and the
feature index is `hash(column_name=feature_value)`, while feature value is `1.0`
* For hash collisions, feature values will be summed
* `null` (missing) values are ignored
The following dataframe illustrates the basic semantics:
```
+---+------+-----+---------+------+-----------------------------------------+
|int|double|float|stringNum|string|features
|
+---+------+-----+---------+------+-----------------------------------------+
|3 |4.0 |5.0 |1 |foo
|(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
|6 |7.0 |8.0 |2 |bar
|(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
+---+------+-----+---------+------+-----------------------------------------+
```
## How was this patch tested?
New unit tests and manual experiments.
Author: Nick Pentreath <[email protected]>
Closes #18513 from MLnick/FeatureHasher.
commit adf005dabe3b0060033e1eeaedbab31a868efc8c
Author: John Lee <[email protected]>
Date: 2017-08-16T14:44:09Z
[SPARK-21656][CORE] spark dynamic allocation should not idle timeout
executors when tasks still to run
## What changes were proposed in this pull request?
Right now spark lets go of executors when they are idle for the 60s (or
configurable time). I have seen spark let them go when they are idle but they
were really needed. I have seen this issue when the scheduler was waiting to
get node locality but that takes longer than the default idle timeout. In these
jobs the number of executors goes down really small (less than 10) but there
are still like 80,000 tasks to run.
We should consider not allowing executors to idle timeout if they are still
needed according to the number of tasks to be run.
## How was this patch tested?
Tested by manually adding executors to `executorsIdsToBeRemoved` list and
seeing if those executors were removed when there are a lot of tasks and a high
`numExecutorsTarget` value.
Code used
In `ExecutorAllocationManager.start()`
```
start_time = clock.getTimeMillis()
```
In `ExecutorAllocationManager.schedule()`
```
val executorIdsToBeRemoved = ArrayBuffer[String]()
if ( now > start_time + 1000 * 60 * 2) {
logInfo("--- REMOVING 1/2 of the EXECUTORS ---")
start_time += 1000 * 60 * 100
var counter = 0
for (x <- executorIds) {
counter += 1
if (counter == 2) {
counter = 0
executorIdsToBeRemoved += x
}
}
}
Author: John Lee <[email protected]>
Closes #18874 from yoonlee95/SPARK-21656.
commit 1cce1a3b639c5c793d43fa51a8ec3e0fef622a40
Author: 10129659 <[email protected]>
Date: 2017-08-16T16:12:20Z
[SPARK-21603][SQL] The wholestage codegen will be much slower then that is
closed when the function is too long
## What changes were proposed in this pull request?
Close the whole stage codegen when the function lines is longer than the
maxlines which will be setted by
spark.sql.codegen.MaxFunctionLength parameter, because when the function is
too long , it will not get the JIT optimizing.
A benchmark test result is 10x slower when the generated function is too
long :
ignore("max function length of wholestagecodegen") {
val N = 20 << 15
val benchmark = new Benchmark("max function length of
wholestagecodegen", N)
def f(): Unit = sparkSession.range(N)
.selectExpr(
"id",
"(id & 1023) as k1",
"cast(id & 1023 as double) as k2",
"cast(id & 1023 as int) as k3",
"case when id > 100 and id <= 200 then 1 else 0 end as v1",
"case when id > 200 and id <= 300 then 1 else 0 end as v2",
"case when id > 300 and id <= 400 then 1 else 0 end as v3",
"case when id > 400 and id <= 500 then 1 else 0 end as v4",
"case when id > 500 and id <= 600 then 1 else 0 end as v5",
"case when id > 600 and id <= 700 then 1 else 0 end as v6",
"case when id > 700 and id <= 800 then 1 else 0 end as v7",
"case when id > 800 and id <= 900 then 1 else 0 end as v8",
"case when id > 900 and id <= 1000 then 1 else 0 end as v9",
"case when id > 1000 and id <= 1100 then 1 else 0 end as v10",
"case when id > 1100 and id <= 1200 then 1 else 0 end as v11",
"case when id > 1200 and id <= 1300 then 1 else 0 end as v12",
"case when id > 1300 and id <= 1400 then 1 else 0 end as v13",
"case when id > 1400 and id <= 1500 then 1 else 0 end as v14",
"case when id > 1500 and id <= 1600 then 1 else 0 end as v15",
"case when id > 1600 and id <= 1700 then 1 else 0 end as v16",
"case when id > 1700 and id <= 1800 then 1 else 0 end as v17",
"case when id > 1800 and id <= 1900 then 1 else 0 end as v18")
.groupBy("k1", "k2", "k3")
.sum()
.collect()
benchmark.addCase(s"codegen = F") { iter =>
sparkSession.conf.set("spark.sql.codegen.wholeStage", "false")
f()
}
benchmark.addCase(s"codegen = T") { iter =>
sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "10000")
f()
}
benchmark.run()
/*
Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1
Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
max function length of wholestagecodegen: Best/Avg Time(ms)
Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
codegen = F 443 / 507 1.5
676.0 1.0X
codegen = T 3279 / 3283 0.2
5002.6 0.1X
*/
}
## How was this patch tested?
Run the unit test
Author: 10129659 <[email protected]>
Closes #18810 from eatoncys/codegen.
commit 7add4e982184c46990995dcd1326e6caad7adf6e
Author: Marco Gaido <[email protected]>
Date: 2017-08-16T16:40:04Z
[SPARK-21738] Thriftserver doesn't cancel jobs when session is closed
## What changes were proposed in this pull request?
When a session is closed the Thriftserver doesn't cancel the jobs which may
still be running. This is a huge waste of resources.
This PR address the problem canceling the pending jobs when a session is
closed.
## How was this patch tested?
The patch was tested manually.
Author: Marco Gaido <[email protected]>
Closes #18951 from mgaido91/SPARK-21738.
commit a0345cbebe23537df4084cf90f9d47425e550150
Author: Peng Meng <[email protected]>
Date: 2017-08-16T18:05:20Z
[SPARK-21680][ML][MLLIB] optimize Vector compress
## What changes were proposed in this pull request?
When use Vector.compressed to change a Vector to SparseVector, the
performance is very low comparing with Vector.toSparse.
This is because you have to scan the value three times using
Vector.compressed, but you just need two times when use Vector.toSparse.
When the length of the vector is large, there is significant performance
difference between this two method.
## How was this patch tested?
The existing UT
Author: Peng Meng <[email protected]>
Closes #18899 from mpjlu/optVectorCompress.
commit b8ffb51055108fd606b86f034747006962cd2df3
Author: Eyal Farago <[email protected]>
Date: 2017-08-17T01:21:50Z
[SPARK-3151][BLOCK MANAGER] DiskStore.getBytes fails for files larger than
2GB
## What changes were proposed in this pull request?
introduced `DiskBlockData`, a new implementation of `BlockData`
representing a whole file.
this is somehow related to
[SPARK-6236](https://issues.apache.org/jira/browse/SPARK-6236) as well
This class follows the implementation of `EncryptedBlockData` just without
the encryption. hence:
* `toInputStream` is implemented using a `FileInputStream` (todo: encrypted
version actually uses `Channels.newInputStream`, not sure if it's the right
choice for this)
* `toNetty` is implemented in terms of `io.netty.channel.DefaultFileRegion`
* `toByteBuffer` fails for files larger than 2GB (same behavior of the
original code, just postponed a bit), it also respects the same configuration
keys defined by the original code to choose between memory mapping and simple
file read.
## How was this patch tested?
added test to DiskStoreSuite and MemoryManagerSuite
Author: Eyal Farago <[email protected]>
Closes #18855 from eyalfa/SPARK-3151.
commit a45133b826984b7856e16d754e01c82702016af7
Author: Wenchen Fan <[email protected]>
Date: 2017-08-17T05:37:45Z
[SPARK-21743][SQL] top-most limit should not cause memory leak
## What changes were proposed in this pull request?
For top-most limit, we will use a special operator to execute it:
`CollectLimitExec`.
`CollectLimitExec` will retrieve `n`(which is the limit) rows from each
partition of the child plan output, see
https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L311.
It's very likely that we don't exhaust the child plan output.
This is fine when whole-stage-codegen is off, as child plan will release
the resource via task completion listener. However, when whole-stage codegen is
on, the resource can only be released if all output is consumed.
To fix this memory leak, one simple approach is, when `CollectLimitExec`
retrieve `n` rows from child plan output, child plan output should only have
`n` rows, then the output is exhausted and resource is released. This can be
done by wrapping child plan with `LocalLimit`
## How was this patch tested?
a regression test
Author: Wenchen Fan <[email protected]>
Closes #18955 from cloud-fan/leak.
commit d695a528bef6291e0e1657f4f3583a8371abd7c8
Author: Hideaki Tanaka <[email protected]>
Date: 2017-08-17T14:02:13Z
[SPARK-21642][CORE] Use FQDN for DRIVER_HOST_ADDRESS instead of ip address
## What changes were proposed in this pull request?
The patch lets spark web ui use FQDN as its hostname instead of ip address.
In current implementation, ip address of a driver host is set to
DRIVER_HOST_ADDRESS. This becomes a problem when we enable SSL using
"spark.ssl.enabled", "spark.ssl.trustStore" and "spark.ssl.keyStore"
properties. When we configure these properties, spark web ui is launched with
SSL enabled and the HTTPS server is configured with the custom SSL certificate
you configured in these properties.
In this case, client gets javax.net.ssl.SSLPeerUnverifiedException
exception when the client accesses the spark web ui because the client fails to
verify the SSL certificate (Common Name of the SSL cert does not match with
DRIVER_HOST_ADDRESS).
To avoid the exception, we should use FQDN of the driver host for
DRIVER_HOST_ADDRESS.
Error message that client gets when the client accesses spark web ui:
javax.net.ssl.SSLPeerUnverifiedException: Certificate for <10.102.138.239>
doesn't match any of the subject alternative names: []
## How was this patch tested?
manual tests
Author: Hideaki Tanaka <[email protected]>
Closes #18846 from thideeeee/SPARK-21642.
commit b83b502c4189c571bda776511c6f7541c6067aae
Author: Kent Yao <[email protected]>
Date: 2017-08-17T16:24:45Z
[SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars
for reusing CliSessionState
## What changes were proposed in this pull request?
Set isolated to false while using builtin hive jars and `SessionState.get`
returns a `CliSessionState` instance.
## How was this patch tested?
1 Unit Tests
2 Manually verified: `hive.exec.strachdir` was only created once because of
reusing cliSessionState
```java
â spark git:(SPARK-21428) â bin/spark-sql --conf
spark.sql.hive.metastore.jars=builtin
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
17/07/16 23:59:27 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/07/16 23:59:27 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/07/16 23:59:27 INFO ObjectStore: ObjectStore, initialize called
17/07/16 23:59:28 INFO Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/07/16 23:59:28 INFO Persistence: Property datanucleus.cache.level2
unknown - will be ignored
17/07/16 23:59:29 INFO ObjectStore: Setting MetaStore object pin classes
with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/07/16 23:59:30 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
17/07/16 23:59:30 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so
does not have its own datastore table.
17/07/16 23:59:31 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
17/07/16 23:59:31 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so
does not have its own datastore table.
17/07/16 23:59:31 INFO MetaStoreDirectSql: Using direct SQL, underlying DB
is DERBY
17/07/16 23:59:31 INFO ObjectStore: Initialized ObjectStore
17/07/16 23:59:31 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording the
schema version 1.2.0
17/07/16 23:59:31 WARN ObjectStore: Failed to get database default,
returning NoSuchObjectException
17/07/16 23:59:32 INFO HiveMetaStore: Added admin role in metastore
17/07/16 23:59:32 INFO HiveMetaStore: Added public role in metastore
17/07/16 23:59:32 INFO HiveMetaStore: No user is added in admin role, since
config is empty
17/07/16 23:59:32 INFO HiveMetaStore: 0: get_all_databases
17/07/16 23:59:32 INFO audit: ugi=Kent ip=unknown-ip-addr
cmd=get_all_databases
17/07/16 23:59:32 INFO HiveMetaStore: 0: get_functions: db=default pat=*
17/07/16 23:59:32 INFO audit: ugi=Kent ip=unknown-ip-addr
cmd=get_functions: db=default pat=*
17/07/16 23:59:32 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as
"embedded-only" so does not have its own datastore table.
17/07/16 23:59:32 INFO SessionState: Created local directory:
/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/beea7261-221a-4711-89e8-8b12a9d37370_resources
17/07/16 23:59:32 INFO SessionState: Created HDFS directory:
/tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370
17/07/16 23:59:32 INFO SessionState: Created local directory:
/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/Kent/beea7261-221a-4711-89e8-8b12a9d37370
17/07/16 23:59:32 INFO SessionState: Created HDFS directory:
/tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370/_tmp_space.db
17/07/16 23:59:32 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
17/07/16 23:59:32 INFO SparkContext: Submitted application:
SparkSQL::10.0.0.8
17/07/16 23:59:32 INFO SecurityManager: Changing view acls to: Kent
17/07/16 23:59:32 INFO SecurityManager: Changing modify acls to: Kent
17/07/16 23:59:32 INFO SecurityManager: Changing view acls groups to:
17/07/16 23:59:32 INFO SecurityManager: Changing modify acls groups to:
17/07/16 23:59:32 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(Kent); groups
with view permissions: Set(); users with modify permissions: Set(Kent); groups
with modify permissions: Set()
17/07/16 23:59:33 INFO Utils: Successfully started service 'sparkDriver' on
port 51889.
17/07/16 23:59:33 INFO SparkEnv: Registering MapOutputTracker
17/07/16 23:59:33 INFO SparkEnv: Registering BlockManagerMaster
17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/07/16 23:59:33 INFO BlockManagerMasterEndpoint:
BlockManagerMasterEndpoint up
17/07/16 23:59:33 INFO DiskBlockManager: Created local directory at
/private/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/blockmgr-9cfae28a-01e9-4c73-a1f1-f76fa52fc7a5
17/07/16 23:59:33 INFO MemoryStore: MemoryStore started with capacity 366.3
MB
17/07/16 23:59:33 INFO SparkEnv: Registering OutputCommitCoordinator
17/07/16 23:59:33 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
17/07/16 23:59:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://10.0.0.8:4040
17/07/16 23:59:33 INFO Executor: Starting executor ID driver on host
localhost
17/07/16 23:59:33 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 51890.
17/07/16 23:59:33 INFO NettyBlockTransferService: Server created on
10.0.0.8:51890
17/07/16 23:59:33 INFO BlockManager: Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
policy
17/07/16 23:59:33 INFO BlockManagerMaster: Registering BlockManager
BlockManagerId(driver, 10.0.0.8, 51890, None)
17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Registering block
manager 10.0.0.8:51890 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.8,
51890, None)
17/07/16 23:59:33 INFO BlockManagerMaster: Registered BlockManager
BlockManagerId(driver, 10.0.0.8, 51890, None)
17/07/16 23:59:33 INFO BlockManager: Initialized BlockManager:
BlockManagerId(driver, 10.0.0.8, 51890, None)
17/07/16 23:59:34 INFO SharedState: Setting hive.metastore.warehouse.dir
('null') to the value of spark.sql.warehouse.dir
('file:/Users/Kent/Documents/spark/spark-warehouse').
17/07/16 23:59:34 INFO SharedState: Warehouse path is
'file:/Users/Kent/Documents/spark/spark-warehouse'.
17/07/16 23:59:34 INFO HiveUtils: Initializing HiveMetastoreConnection
version 1.2.1 using Spark classes.
17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client
(version 1.2.2) is /user/hive/warehouse
17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: default
17/07/16 23:59:34 INFO audit: ugi=Kent ip=unknown-ip-addr
cmd=get_database: default
17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client
(version 1.2.2) is /user/hive/warehouse
17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: global_temp
17/07/16 23:59:34 INFO audit: ugi=Kent ip=unknown-ip-addr
cmd=get_database: global_temp
17/07/16 23:59:34 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client
(version 1.2.2) is /user/hive/warehouse
17/07/16 23:59:34 INFO StateStoreCoordinatorRef: Registered
StateStoreCoordinator endpoint
spark-sql>
```
cc cloud-fan gatorsmile
Author: Kent Yao <[email protected]>
Author: hzyaoqin <[email protected]>
Closes #18648 from yaooqinn/SPARK-21428.
commit ae9e42479253a9cd30423476405377f2d7952137
Author: gatorsmile <[email protected]>
Date: 2017-08-17T20:00:37Z
[SQL][MINOR][TEST] Set spark.unsafe.exceptionOnMemoryLeak to true
## What changes were proposed in this pull request?
When running IntelliJ, we are unable to capture the exception of memory
leak detection.
> org.apache.spark.executor.Executor: Managed memory leak detected
Explicitly setting `spark.unsafe.exceptionOnMemoryLeak` in SparkConf when
building the SparkSession, instead of reading it from system properties.
## How was this patch tested?
N/A
Author: gatorsmile <[email protected]>
Closes #18967 from gatorsmile/setExceptionOnMemoryLeak.
commit 6aad02d03632df964363a144c96371e86f7b207e
Author: Takeshi Yamamuro <[email protected]>
Date: 2017-08-17T20:47:14Z
[SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent
## What changes were proposed in this pull request?
This pr sorted output attributes on their name and exprId in
`AttributeSet.toSeq` to make the order consistent. If the order is different,
spark possibly generates different code and then misses cache in
`CodeGenerator`, e.g., `GenerateColumnAccessor` generates code depending on an
input attribute order.
## How was this patch tested?
Added tests in `AttributeSetSuite` and manually checked if the cache worked
well in the given query of the JIRA.
Author: Takeshi Yamamuro <[email protected]>
Closes #18959 from maropu/SPARK-18394.
commit bfdc361ededb2ed4e323f075fdc40ed004b7f41d
Author: ArtRand <[email protected]>
Date: 2017-08-17T22:47:07Z
[SPARK-16742] Mesos Kerberos Support
## What changes were proposed in this pull request?
Add Kerberos Support to Mesos. This includes kinit and --keytab support,
but does not include delegation token renewal.
## How was this patch tested?
Manually against a Secure DC/OS Apache HDFS cluster.
Author: ArtRand <[email protected]>
Author: Michael Gummelt <[email protected]>
Closes #18519 from mgummelt/SPARK-16742-kerberos.
commit 7ab951885fd34aa8184b70a3a39b865a239e5052
Author: Jen-Ming Chung <[email protected]>
Date: 2017-08-17T22:59:45Z
[SPARK-21677][SQL] json_tuple throws NullPointException when column is null
as string type
## What changes were proposed in this pull request?
``` scala
scala> Seq(("""{"Hyukjin": 224, "John":
1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
...
java.lang.NullPointerException
at ...
```
Currently the `null` field name will throw NullPointException. As a given
field name null can't be matched with any field names in json, we just output
null as its column value. This PR achieves it by returning a very unlikely
column name `__NullFieldName` in evaluation of the field names.
## How was this patch tested?
Added unit test.
Author: Jen-Ming Chung <[email protected]>
Closes #18930 from jmchung/SPARK-21677.
commit 2caaed970e3e26ae59be5999516a737aff3e5c78
Author: gatorsmile <[email protected]>
Date: 2017-08-17T23:33:39Z
[SPARK-21767][TEST][SQL] Add Decimal Test For Avro in VersionSuite
## What changes were proposed in this pull request?
Decimal is a logical type of AVRO. We need to ensure the support of Hive's
AVRO serde works well in Spark
## How was this patch tested?
N/A
Author: gatorsmile <[email protected]>
Closes #18977 from gatorsmile/addAvroTest.
commit 310454be3b0ce5ff6b6ef0070c5daadf6fb16927
Author: donnyzone <[email protected]>
Date: 2017-08-18T05:37:32Z
[SPARK-21739][SQL] Cast expression should initialize timezoneId when it is
called statically to convert something into TimestampType
## What changes were proposed in this pull request?
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739
This issue is caused by introducing TimeZoneAwareExpression.
When the **Cast** expression converts something into TimestampType, it
should be resolved with setting `timezoneId`. In general, it is resolved in
LogicalPlan phase.
However, there are still some places that use Cast expression statically to
convert datatypes without setting `timezoneId`. In such cases,
`NoSuchElementException: None.get` will be thrown for TimestampType.
This PR is proposed to fix the issue. We have checked the whole project and
found two such usages(i.e., in`TableReader` and `HiveTableScanExec`).
## How was this patch tested?
unit test
Author: donnyzone <[email protected]>
Closes #18960 from DonnyZone/spark-21739.
commit 07a2b8738ed8e6c136545d03f91a865de05e41a0
Author: Reynold Xin <[email protected]>
Date: 2017-08-18T14:58:20Z
[SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java
## What changes were proposed in this pull request?
Dataset.sample requires a boolean flag withReplacement as the first
argument. However, most of the time users simply want to sample some records
without replacement. This ticket introduces a new sample function that simply
takes in the fraction and seed.
## How was this patch tested?
Tested manually. Not sure yet if we should add a test case for just this
wrapper ...
Author: Reynold Xin <[email protected]>
Closes #18988 from rxin/SPARK-21778.
commit 23ea8980809497d0372084adf5936602655e1685
Author: Masha Basmanova <[email protected]>
Date: 2017-08-18T16:54:39Z
[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount
and sizeInBytes
## What changes were proposed in this pull request?
Added support for ANALYZE TABLE [db_name].tablename PARTITION
(partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command
to calculate total number of rows and size in bytes for a subset of partitions.
Calculated statistics are stored in Hive Metastore as user-defined properties
attached to partition objects. Property names are the same as the ones used to
store table-level statistics: spark.sql.statistics.totalSize and
spark.sql.statistics.numRows.
When partition specification contains all partition columns with values,
the command collects statistics for a single partition that matches the
specification. When some partition columns are missing or listed without their
values, the command collects statistics for all partitions which match a subset
of partition column values specified.
For example, table t has 4 partitions with the following specs:
* Partition1: (ds='2008-04-08', hr=11)
* Partition2: (ds='2008-04-08', hr=12)
* Partition3: (ds='2008-04-09', hr=11)
* Partition4: (ds='2008-04-09', hr=12)
'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect
statistics only for partition 3.
'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect
statistics for partitions 3 and 4.
'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for
all four partitions.
When the optional parameter NOSCAN is specified, the command doesn't count
number of rows and only gathers size in bytes.
The statistics gathered by ANALYZE TABLE command can be fetched using DESC
EXTENDED [db_name.]tablename PARTITION command.
## How was this patch tested?
Added tests.
Author: Masha Basmanova <[email protected]>
Closes #18421 from mbasmanova/mbasmanova-analyze-partition.
commit 7880909c45916ab76dccac308a9b2c5225a00e89
Author: Wenchen Fan <[email protected]>
Date: 2017-08-18T18:19:22Z
[SPARK-21743][SQL][FOLLOW-UP] top-most limit should not cause memory leak
## What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix
a bug that we break whole stage codegen for `Limit`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <[email protected]>
Closes #18993 from cloud-fan/bug.
commit a2db5c5761b0c72babe48b79859d3b208ee8e9f6
Author: Andrew Ash <[email protected]>
Date: 2017-08-18T20:43:42Z
[MINOR][TYPO] Fix typos: runnning and Excecutors
## What changes were proposed in this pull request?
Fix typos
## How was this patch tested?
Existing tests
Author: Andrew Ash <[email protected]>
Closes #18996 from ash211/patch-2.
commit 10be01848ef28004a287940a4e8d8a044e14b257
Author: Andrew Ray <[email protected]>
Date: 2017-08-19T01:10:54Z
[SPARK-21566][SQL][PYTHON] Python method for summary
## What changes were proposed in this pull request?
Adds the recently added `summary` method to the python dataframe interface.
## How was this patch tested?
Additional inline doctests.
Author: Andrew Ray <[email protected]>
Closes #18762 from aray/summary-py.
commit 72b738d8dcdb7893003c81bf1c73bbe262852d1a
Author: Yuming Wang <[email protected]>
Date: 2017-08-19T18:41:32Z
[SPARK-21790][TESTS] Fix Docker-based Integration Test errors.
## What changes were proposed in this pull request?
[SPARK-17701](https://github.com/apache/spark/pull/18600/files#diff-b9f96d092fb3fea76bcf75e016799678L77)
removed `metadata` function, this PR removed the Docker-based Integration
module that has been relevant to `SparkPlan.metadata`.
## How was this patch tested?
manual tests
Author: Yuming Wang <[email protected]>
Closes #19000 from wangyum/SPARK-21709.
commit 73e04ecc4f29a0fe51687ed1337c61840c976f89
Author: Cédric Pelvet <[email protected]>
Date: 2017-08-20T10:05:54Z
[MINOR] Correct validateAndTransformSchema in GaussianMixture and
AFTSurvivalRegression
## What changes were proposed in this pull request?
The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
did not modify the variable schema, hence only the last line had any effect. A
temporary variable is used to correctly append the two columns predictionCol
and probabilityCol.
## How was this patch tested?
Manually.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Cédric Pelvet <[email protected]>
Closes #18980 from sharp-pixel/master.
commit 41e0eb71a63140c9a44a7d2f32821f02abd62367
Author: hyukjinkwon <[email protected]>
Date: 2017-08-20T10:48:04Z
[SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL
documentation build
## What changes were proposed in this pull request?
This PR proposes to install `mkdocs` by `pip install` if missing in the
path. Mainly to fix Jenkins's documentation build failure in
`spark-master-docs`. See
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console.
It also adds `mkdocs` as requirements in `docs/README.md`.
## How was this patch tested?
I manually ran `jekyll build` under `docs` directory after manually
removing `mkdocs` via `pip uninstall mkdocs`.
Also, tested this in the same way but on CentOS Linux release 7.3.1611
(Core) where I built Spark few times but never built documentation before and
`mkdocs` is not installed.
```
...
Moving back into docs dir.
Moving to SQL directory and building docs.
Missing mkdocs in your path, trying to install mkdocs for SQL documentation
generation.
Collecting mkdocs
Downloading mkdocs-0.16.3-py2.py3-none-any.whl (1.2MB)
100%
|ââââââââââââââââââââââââââââââââ|
1.2MB 574kB/s
Requirement already satisfied: PyYAML>=3.10 in
/usr/lib64/python2.7/site-packages (from mkdocs)
Collecting livereload>=2.5.1 (from mkdocs)
Downloading livereload-2.5.1-py2-none-any.whl
Collecting tornado>=4.1 (from mkdocs)
Downloading tornado-4.5.1.tar.gz (483kB)
100%
|ââââââââââââââââââââââââââââââââ|
491kB 1.4MB/s
Collecting Markdown>=2.3.1 (from mkdocs)
Downloading Markdown-2.6.9.tar.gz (271kB)
100%
|ââââââââââââââââââââââââââââââââ|
276kB 2.4MB/s
Collecting click>=3.3 (from mkdocs)
Downloading click-6.7-py2.py3-none-any.whl (71kB)
100%
|ââââââââââââââââââââââââââââââââ|
71kB 2.8MB/s
Requirement already satisfied: Jinja2>=2.7.1 in
/usr/lib/python2.7/site-packages (from mkdocs)
Requirement already satisfied: six in /usr/lib/python2.7/site-packages
(from livereload>=2.5.1->mkdocs)
Requirement already satisfied: backports.ssl_match_hostname in
/usr/lib/python2.7/site-packages (from tornado>=4.1->mkdocs)
Collecting singledispatch (from tornado>=4.1->mkdocs)
Downloading singledispatch-3.4.0.3-py2.py3-none-any.whl
Collecting certifi (from tornado>=4.1->mkdocs)
Downloading certifi-2017.7.27.1-py2.py3-none-any.whl (349kB)
100%
|ââââââââââââââââââââââââââââââââ|
358kB 2.1MB/s
Collecting backports_abc>=0.4 (from tornado>=4.1->mkdocs)
Downloading backports_abc-0.5-py2.py3-none-any.whl
Requirement already satisfied: MarkupSafe>=0.23 in
/usr/lib/python2.7/site-packages (from Jinja2>=2.7.1->mkdocs)
Building wheels for collected packages: tornado, Markdown
Running setup.py bdist_wheel for tornado ... done
Stored in directory:
/root/.cache/pip/wheels/84/83/cd/6a04602633457269d161344755e6766d24307189b7a67ff4b7
Running setup.py bdist_wheel for Markdown ... done
Stored in directory:
/root/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5
Successfully built tornado Markdown
Installing collected packages: singledispatch, certifi, backports-abc,
tornado, livereload, Markdown, click, mkdocs
Successfully installed Markdown-2.6.9 backports-abc-0.5 certifi-2017.7.27.1
click-6.7 livereload-2.5.1 mkdocs-0.16.3 singledispatch-3.4.0.3 tornado-4.5.1
Generating markdown files for SQL documentation.
Generating HTML files for SQL documentation.
INFO - Cleaning site directory
INFO - Building documentation to directory: .../spark/sql/site
Moving back into docs dir.
Making directory api/sql
cp -r ../sql/site/. api/sql
Source: .../spark/docs
Destination: .../spark/docs/_site
Generating...
done.
Auto-regeneration: disabled. Use --watch to enable.
```
Author: hyukjinkwon <[email protected]>
Closes #18984 from HyukjinKwon/sql-doc-mkdocs.
commit 28a6cca7df900d13613b318c07acb97a5722d2b8
Author: Liang-Chi Hsieh <[email protected]>
Date: 2017-08-20T16:45:23Z
[SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths
are successfully removed
## What changes were proposed in this pull request?
Fix a typo in test.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <[email protected]>
Closes #19005 from viirya/SPARK-21721-followup.
commit 77d046ec47a9bfa6323aa014869844c28e18e049
Author: Sergey Serebryakov <[email protected]>
Date: 2017-08-21T07:21:25Z
[SPARK-21782][CORE] Repartition creates skews when numPartitions is a power
of 2
## Problem
When an RDD (particularly with a low item-per-partition ratio) is
repartitioned to numPartitions = power of 2, the resulting partitions are very
uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG
only once. See details in https://issues.apache.org/jira/browse/SPARK-21782
## What changes were proposed in this pull request?
Instead of directly using `0, 1, 2,...` seeds to initialize `Random`, hash
them with `scala.util.hashing.byteswap32()`.
## How was this patch tested?
`build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test`
Author: Sergey Serebryakov <[email protected]>
Closes #18990 from megaserg/repartition-skew.
commit b3a07526fe774fd64fe3a2b9a2381eff9a3c49a3
Author: Sean Owen <[email protected]>
Date: 2017-08-21T12:20:40Z
[SPARK-21718][SQL] Heavy log of type: "Skipping partition based on stats
..."
## What changes were proposed in this pull request?
Reduce 'Skipping partitions' message to debug
## How was this patch tested?
Existing tests
Author: Sean Owen <[email protected]>
Closes #19010 from srowen/SPARK-21718.
commit 988b84d7ed43bea2616527ff050dffcf20548ab2
Author: Nick Pentreath <[email protected]>
Date: 2017-08-21T12:35:38Z
[SPARK-21468][PYSPARK][ML] Python API for FeatureHasher
Add Python API for `FeatureHasher` transformer.
## How was this patch tested?
New doc test.
Author: Nick Pentreath <[email protected]>
Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.
commit ba843292e37368e1f5e4ae5c99ba1f5f90ca6025
Author: Yuming Wang <[email protected]>
Date: 2017-08-21T17:16:56Z
[SPARK-21790][TESTS][FOLLOW-UP] Add filter pushdown verification back.
## What changes were proposed in this pull request?
The previous PR(https://github.com/apache/spark/pull/19000) removed filter
pushdown verification, This PR add them back.
## How was this patch tested?
manual tests
Author: Yuming Wang <[email protected]>
Closes #19002 from wangyum/SPARK-21790-follow-up.
commit 84b5b16ea6c9816c70f7471a50eb5e4acb7fb1a1
Author: Marcelo Vanzin <[email protected]>
Date: 2017-08-21T22:09:02Z
[SPARK-21617][SQL] Store correct table metadata when altering schema in
Hive metastore.
For Hive tables, the current "replace the schema" code is the correct
path, except that an exception in that path should result in an error, and
not in retrying in a different way.
For data source tables, Spark may generate a non-compatible Hive table;
but for that to work with Hive 2.1, the detection of data source tables
needs
to be fixed in the Hive client, to also consider the raw tables used by code
such as `alterTableSchema`.
Tested with existing and added unit tests (plus internal tests with a 2.1
metastore).
Author: Marcelo Vanzin <[email protected]>
Closes #18849 from vanzin/SPARK-21617.
commit c108a5d30e821fef23709681fca7da22bc507129
Author: Yanbo Liang <[email protected]>
Date: 2017-08-22T00:43:18Z
[SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization.
## What changes were proposed in this pull request?
MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always
standardize the data during training to improve the rate of convergence
regardless of _standardization_ is true or false. If _standardization_ is
false, we perform reverse standardization by penalizing each component
differently to get effectively the same objective function when the training
dataset is not standardized. We should keep these comments in the code to let
developers understand how we handle it correctly.
## How was this patch tested?
Existing tests, only adding some comments in code.
Author: Yanbo Liang <[email protected]>
Closes #18992 from yanboliang/SPARK-19762.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]