GitHub user NamelessAnalyst opened a pull request:
https://github.com/apache/spark/pull/2568
SPARK-3716 [GraphX] Update Analytics.scala for partitionStrategy assignment
Previously, when the val partitionStrategy was created it called a function
in the Analytics object which was a copy of the PartitionStrategy.fromString()
method. This function has been removed, and the assignment of partitionStrategy
now uses the PartitionStrategy.fromString method instead. In this way, it
better matches the declarations of edge/vertex StorageLevel variables.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/NamelessAnalyst/spark branch-1.1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2568.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2568
----
commit 8c79574462eed113fc59d4323eedfc55c6e95c06
Author: Cheng Lian <[email protected]>
Date: 2014-08-16T18:26:51Z
[SQL] Using safe floating-point numbers in doctest
Test code in `sql.py` tries to compare two floating-point numbers directly,
and cased [build
failure(s)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18365/consoleFull).
[Doctest
documentation](https://docs.python.org/3/library/doctest.html#warnings)
recommends using numbers in the form of `I/2**J` to avoid the precision issue.
Author: Cheng Lian <[email protected]>
Closes #1925 from liancheng/fix-pysql-fp-test and squashes the following
commits:
0fbf584 [Cheng Lian] Removed unnecessary `...' from inferSchema doctest
e8059d4 [Cheng Lian] Using safe floating-point numbers in doctest
(cherry picked from commit b4a05928e95c0f6973fd21e60ff9c108f226e38c)
Signed-off-by: Michael Armbrust <[email protected]>
commit bd3ce2ffb8964abb4d59918ebb2c230fe4614aa2
Author: Kousuke Saruta <[email protected]>
Date: 2014-08-16T21:15:58Z
[SPARK-2677] BasicBlockFetchIterator#next can wait forever
Author: Kousuke Saruta <[email protected]>
Closes #1632 from sarutak/SPARK-2677 and squashes the following commits:
cddbc7b [Kousuke Saruta] Removed Exception throwing when
ConnectionManager#handleMessage receives ack for non-referenced message
d3bd2a8 [Kousuke Saruta] Modified configuration.md for
spark.core.connection.ack.timeout
e85f88b [Kousuke Saruta] Removed useless synchronized blocks
7ed48be [Kousuke Saruta] Modified ConnectionManager to use
ackTimeoutMonitor ConnectionManager-wide
9b620a6 [Kousuke Saruta] Merge branch 'master' of
git://git.apache.org/spark into SPARK-2677
0dd9ad3 [Kousuke Saruta] Modified typo in ConnectionManagerSuite.scala
7cbb8ca [Kousuke Saruta] Modified to match with scalastyle
8a73974 [Kousuke Saruta] Merge branch 'master' of
git://git.apache.org/spark into SPARK-2677
ade279a [Kousuke Saruta] Merge branch 'master' of
git://git.apache.org/spark into SPARK-2677
0174d6a [Kousuke Saruta] Modified ConnectionManager.scala to handle the
case remote Executor cannot ack
a454239 [Kousuke Saruta] Merge branch 'master' of
git://git.apache.org/spark into SPARK-2677
9b7b7c1 [Kousuke Saruta] (WIP) Modifying ConnectionManager.scala
(cherry picked from commit 76fa0eaf515fd6771cdd69422b1259485debcae5)
Signed-off-by: Josh Rosen <[email protected]>
commit 0b354be2f9ec35547a60591acf4f4773a4869690
Author: Xiangrui Meng <[email protected]>
Date: 2014-08-16T22:13:34Z
[SPARK-3048][MLLIB] add LabeledPoint.parse and remove
loadStreamingLabeledPoints
Move `parse()` from `LabeledPointParser` to `LabeledPoint` and make it
public. This breaks binary compatibility only when a user uses synthesized
methods like `tupled` and `curried`, which is rare.
`LabeledPoint.parse` is more consistent with `Vectors.parse`, which is why
`LabeledPointParser` is not preferred.
freeman-lab tdas
Author: Xiangrui Meng <[email protected]>
Closes #1952 from mengxr/labelparser and squashes the following commits:
c818fb2 [Xiangrui Meng] merge master
ce20e6f [Xiangrui Meng] update mima excludes
b386b8d [Xiangrui Meng] fix tests
2436b3d [Xiangrui Meng] add parse() to LabeledPoint
(cherry picked from commit 7e70708a99949549adde00cb6246a9582bbc4929)
Signed-off-by: Xiangrui Meng <[email protected]>
commit a12d3ae3223535e6e4c774e4a289b8b2f2e5228b
Author: Xiangrui Meng <[email protected]>
Date: 2014-08-16T22:14:43Z
[SPARK-3081][MLLIB] rename RandomRDDGenerators to RandomRDDs
`RandomRDDGenerators` means factory for `RandomRDDGenerator`. However, its
methods return RDDs but not RDDGenerators. So a more proper (and shorter) name
would be `RandomRDDs`.
dorx brkyvz
Author: Xiangrui Meng <[email protected]>
Closes #1979 from mengxr/randomrdds and squashes the following commits:
b161a2d [Xiangrui Meng] rename RandomRDDGenerators to RandomRDDs
(cherry picked from commit ac6411c6e75906997c78de23dfdbc8d225b87cfd)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 721f2fdc95032132af3d4a00dbc8399d356f8faf
Author: iAmGhost <[email protected]>
Date: 2014-08-16T23:48:38Z
[SPARK-3035] Wrong example with SparkContext.addFile
https://issues.apache.org/jira/browse/SPARK-3035
fix for wrong document.
Author: iAmGhost <[email protected]>
Closes #1942 from iAmGhost/master and squashes the following commits:
487528a [iAmGhost] [SPARK-3035] Wrong example with SparkContext.addFile fix
for wrong document.
(cherry picked from commit 379e7585c356f20bf8b4878ecba9401e2195da12)
Signed-off-by: Josh Rosen <[email protected]>
commit 5dd571c29ef97cadd23a54fcf4d5de869e3c56bc
Author: Davies Liu <[email protected]>
Date: 2014-08-16T23:59:34Z
[SPARK-1065] [PySpark] improve supporting for large broadcast
Passing large object by py4j is very slow (cost much memory), so pass
broadcast objects via files (similar to parallelize()).
Add an option to keep object in driver (it's False by default) to save
memory in driver.
Author: Davies Liu <[email protected]>
Closes #1912 from davies/broadcast and squashes the following commits:
e06df4a [Davies Liu] load broadcast from disk in driver automatically
db3f232 [Davies Liu] fix serialization of accumulator
631a827 [Davies Liu] Merge branch 'master' into broadcast
c7baa8c [Davies Liu] compress serrialized broadcast and command
9a7161f [Davies Liu] fix doc tests
e93cf4b [Davies Liu] address comments: add test
6226189 [Davies Liu] improve large broadcast
(cherry picked from commit 2fc8aca086a2679b854038b7e2c488f19039ecbd)
Signed-off-by: Josh Rosen <[email protected]>
commit f02e327f0bc975e7f33092e449bc0edd95f95580
Author: GuoQiang Li <[email protected]>
Date: 2014-08-17T03:05:55Z
In the stop method of ConnectionManager to cancel the ackTimeoutMonitor
cc JoshRosen sarutak
Author: GuoQiang Li <[email protected]>
Closes #1989 from witgo/cancel_ackTimeoutMonitor and squashes the following
commits:
4a700fa [GuoQiang Li] In the stop method of ConnectionManager to cancel the
ackTimeoutMonitor
(cherry picked from commit bc95fe08dff62a0abea314ab4ab9275c8f119598)
Signed-off-by: Josh Rosen <[email protected]>
commit 413a329e186de2ec96f80f614c36678bee6f332f
Author: Xiangrui Meng <[email protected]>
Date: 2014-08-17T04:16:27Z
[SPARK-3077][MLLIB] fix some chisq-test
- promote nullHypothesis field in ChiSqTestResult to TestResult. Every test
should have a null hypothesis
- correct null hypothesis statement for independence test
- p-value: 0.01 -> 0.1
Author: Xiangrui Meng <[email protected]>
Closes #1982 from mengxr/fix-chisq and squashes the following commits:
5f0de02 [Xiangrui Meng] make ChiSqTestResult constructor package private
bc74ea1 [Xiangrui Meng] update chisq-test
(cherry picked from commit fbad72288d8b6e641b00417a544cae6e8bfef2d7)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 91af120b4391656cb8f7b2300202dc622c032c33
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-17T06:53:14Z
[SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up
DecisionTree needs to match each example to a node at each iteration. It
currently does this with a set of filters very inefficiently: For each example,
it examines each node at the current level and traces up to the root to see if
that example should be handled by that node.
Fix: Filter top-down using the partly built tree itself.
Major changes:
* Eliminated Filter class, findBinsForLevel() method.
* Set up node parent links in main loop over levels in train().
* Added predictNodeIndex() for filtering top-down.
* Added DTMetadata class
Other changes:
* Pre-compute set of unorderedFeatures.
Notes for following expected PR based on
[https://issues.apache.org/jira/browse/SPARK-3043]:
* The unorderedFeatures set will next be stored in a metadata structure to
simplify function calls (to store other items such as the data in strategy).
I've done initial tests indicating that this speeds things up, but am only
now running large-scale ones.
CC: mengxr manishamde chouqin Any comments are welcome---thanks!
Author: Joseph K. Bradley <[email protected]>
Closes #1975 from jkbradley/dt-opt2 and squashes the following commits:
a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.
Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main
change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second
level. Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change:
persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer
used. Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error. Added
some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods
private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own
file, and cleaned it up. Removed debugging println calls from DecisionTree.
Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification:
Updated calculateGainForSplit to take aggregates for a single (feature, split)
pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters +
small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint
representation to avoid calling findBin multiple times. * (not working yet, but
debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
(cherry picked from commit 73ab7f141c205df277c6ac19252e590d6806c41f)
Signed-off-by: Xiangrui Meng <[email protected]>
commit d411f4190252546b0ea99c1934efd5e5f84be50c
Author: Patrick Wendell <[email protected]>
Date: 2014-08-17T22:48:39Z
SPARK-2881: Upgrade to Snappy 1.0.5.3 to avoid SPARK-2881.
This version of Snappy was released with a backported fix specifically
for Spark. This fixes an issue where names collide in the snappy .so
file when users are submitting jobs as different users on the same
cluster.
Author: Patrick Wendell <[email protected]>
Closes #1999 from pwendell/snappy-upgrade and squashes the following
commits:
38974ff [Patrick Wendell] SPARK-2881: Upgrade to Snappy 1.0.5.3 to avoid
SPARK-2881.
commit c6a0091ea401e0bec58d7607eb42be89cc090868
Author: Michael Armbrust <[email protected]>
Date: 2014-08-18T01:10:45Z
Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when
EventLogging is enabled"
Revert #1891 due to issues with hadoop 1 compatibility.
Author: Michael Armbrust <[email protected]>
Closes #2007 from marmbrus/revert1891 and squashes the following commits:
68706c0 [Michael Armbrust] Revert "[SPARK-2970] [SQL] spark-sql script ends
with IOException when EventLogging is enabled"
(cherry picked from commit 5ecb08ea063166564178885b7515abef0d76eecb)
Signed-off-by: Michael Armbrust <[email protected]>
commit 4f776dfab726f54c948a83a7157b958903c15ecf
Author: Michael Armbrust <[email protected]>
Date: 2014-08-18T02:00:38Z
[SQL] Improve debug logging and toStrings.
Author: Michael Armbrust <[email protected]>
Closes #2004 from marmbrus/codgenDebugging and squashes the following
commits:
b7a7e41 [Michael Armbrust] Improve debug logging and toStrings.
(cherry picked from commit bfa09b01d7eddc572cd22ca2e418a735b4ccc826)
Signed-off-by: Michael Armbrust <[email protected]>
commit 826356725ffb3189180f7879d3f9c449924785f3
Author: Chris Fregly <[email protected]>
Date: 2014-08-18T02:33:15Z
[SPARK-1981] updated streaming-kinesis.md
fixed markup, separated out sections more-clearly, more thorough
explanations
Author: Chris Fregly <[email protected]>
Closes #1757 from cfregly/master and squashes the following commits:
9b1c71a [Chris Fregly] better explained why spark checkpoints are disabled
in the example (due to no stateful operations being used)
0f37061 [Chris Fregly] SPARK-1981: (Kinesis streaming support) updated
streaming-kinesis.md
862df67 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
8e1ae2e [Chris Fregly] Merge remote-tracking branch 'upstream/master'
4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be
more clear, removed retries around store() method
0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back
into extras/kinesis-asl
691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with
JavaKinesisWordCount during union of streams
0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
74e5c7c [Chris Fregly] updated per TD's feedback. simplified examples,
updated docs
e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
bf614e9 [Chris Fregly] per matei's feedback: moved the kinesis examples
into the examples/ dir
d17ca6d [Chris Fregly] per TD's feedback: updated docs, simplified the
KinesisUtils api
912640c [Chris Fregly] changed the foundKinesis class to be a
publically-avail class
db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and
kinesis client
338997e [Chris Fregly] improve build docs for kinesis
828f8ae [Chris Fregly] more cleanup
e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
cd68c0d [Chris Fregly] fixed typos and backward compatibility
d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
(cherry picked from commit 99243288b049f4a4fb4ba0505ea2310be5eb4bd2)
Signed-off-by: Tathagata Das <[email protected]>
commit 8438daf2c2a04e48465fc2681d142ca5a6dec747
Author: Xiangrui Meng <[email protected]>
Date: 2014-08-18T03:53:18Z
[SPARK-3087][MLLIB] fix col indexing bug in chi-square and add a check for
number of distinct values
There is a bug determining the column index. dorx
Author: Xiangrui Meng <[email protected]>
Closes #1997 from mengxr/chisq-index and squashes the following commits:
8fc2ab2 [Xiangrui Meng] fix col indexing bug and add a check for number of
distinct values
(cherry picked from commit c77f40668fbb5b8bca9a9b25c039895cb7a4a80c)
Signed-off-by: Xiangrui Meng <[email protected]>
commit a5ae720745d744ec29741b49d2d362f362d53fa4
Author: Patrick Wendell <[email protected]>
Date: 2014-08-18T05:29:58Z
SPARK-2884: Create binary builds in parallel with release script.
commit 0506539b0e853d474183078814fb0f550bfbbd67
Author: Sandy Ryza <[email protected]>
Date: 2014-08-18T05:39:06Z
SPARK-2900. aggregate inputBytes per stage
Author: Sandy Ryza <[email protected]>
Closes #1826 from sryza/sandy-spark-2900 and squashes the following commits:
43f9091 [Sandy Ryza] SPARK-2900
(cherry picked from commit df652ea02a3e42d987419308ef14874300347373)
Signed-off-by: Patrick Wendell <[email protected]>
commit 708cde99a142c90f5a06c7aa326b622d80022e3d
Author: Liquan Pei <[email protected]>
Date: 2014-08-18T06:29:44Z
[SPARK-3097][MLlib] Word2Vec performance improvement
mengxr Please review the code. Adding weights in reduceByKey soon.
Only output model entry for words appeared in the partition before merging
and use reduceByKey to combine model. In general, this implementation is 30s or
so faster than implementation using big array.
Author: Liquan Pei <[email protected]>
Closes #1932 from Ishiihara/Word2Vec-improve2 and squashes the following
commits:
d5377a9 [Liquan Pei] use syn0Global and syn1Global to represent model
cad2011 [Liquan Pei] bug fix for synModify array out of bound
083aa66 [Liquan Pei] update synGlobal in place and reduce synOut size
9075e1c [Liquan Pei] combine syn0Global and syn1Global to synGlobal
aa2ab36 [Liquan Pei] use reduceByKey to combine models
(cherry picked from commit 3c8fa505900ac158d57de36f6b0fd6da05f8893b)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 518258f1ba4d79a72e1a97ebebb1b51cd392c503
Author: Liquan Pei <[email protected]>
Date: 2014-08-18T06:30:47Z
[SPARK-2842][MLlib]Word2Vec documentation
mengxr
Documentation for Word2Vec
Author: Liquan Pei <[email protected]>
Closes #2003 from Ishiihara/Word2Vec-doc and squashes the following commits:
4ff11d4 [Liquan Pei] minor fix
8d7458f [Liquan Pei] code reformat
6df0dcb [Liquan Pei] add Word2Vec documentation
(cherry picked from commit eef779b8d631de971d440051cae21040f4de558f)
Signed-off-by: Xiangrui Meng <[email protected]>
commit e0bc333b6ad36feac5397600fe6948dcb37a8e44
Author: Liquan Pei <[email protected]>
Date: 2014-08-18T08:15:45Z
[MLlib] Remove transform(dataset: RDD[String]) from Word2Vec public API
mengxr
Remove transform(dataset: RDD[String]) from public API.
Author: Liquan Pei <[email protected]>
Closes #2010 from Ishiihara/Word2Vec-api and squashes the following commits:
17b1031 [Liquan Pei] remove transform(dataset: RDD[String]) from public API
(cherry picked from commit 9306b8c6c8c412b9d0d5cffb6bd7a87784f0f6bf)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 12f16ba3fa1f3cde9f43c094029017f4192b1bac
Author: Chandan Kumar <[email protected]>
Date: 2014-08-18T16:52:25Z
[SPARK-2862] histogram method fails on some choices of bucketCount
Author: Chandan Kumar <[email protected]>
Closes #1787 from nrchandan/spark-2862 and squashes the following commits:
a76bbf6 [Chandan Kumar] [SPARK-2862] Fix for a broken test case and add new
test cases
4211eea [Chandan Kumar] [SPARK-2862] Add Scala bug id
13854f1 [Chandan Kumar] [SPARK-2862] Use shorthand range notation to avoid
Scala bug
(cherry picked from commit f45efbb8aaa65bc46d65e77e93076fbc29f4455d)
Signed-off-by: Xiangrui Meng <[email protected]>
commit ec0b91edd592cf89be349e0e5ad7553e02f70cd3
Author: Patrick Wendell <[email protected]>
Date: 2014-08-18T17:00:46Z
SPARK-3096: Include parquet hive serde by default in build
A small change - we should just add this dependency. It doesn't have any
recursive deps and it's needed for reading have parquet tables.
Author: Patrick Wendell <[email protected]>
Closes #2009 from pwendell/parquet and squashes the following commits:
e411f9f [Patrick Wendell] SPARk-309: Include parquet hive serde by default
in build
(cherry picked from commit 7ae28d1247e4756219016206c51fec1656e3917b)
Signed-off-by: Michael Armbrust <[email protected]>
commit 55e9dd637bdef3a2acf56af95410219e23c9502a
Author: Matei Zaharia <[email protected]>
Date: 2014-08-18T17:05:52Z
[SPARK-3084] [SQL] Collect broadcasted tables in parallel in joins
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.
Author: Matei Zaharia <[email protected]>
Closes #1990 from mateiz/spark-3084 and squashes the following commits:
f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel
in joins
(cherry picked from commit 6a13dca12fac06f3af892ffcc8922cc84f91b786)
Signed-off-by: Michael Armbrust <[email protected]>
commit 4da76fc81c224b04bd652c4a72fb77516a32de0c
Author: Matei Zaharia <[email protected]>
Date: 2014-08-18T17:45:24Z
[SPARK-3085] [SQL] Use compact data structures in SQL joins
This reuses the CompactBuffer from Spark Core to save memory and pointer
dereferences. I also tried AppendOnlyMap instead of java.util.HashMap
but unfortunately that slows things down because it seems to do more
equals() calls and the equals on GenericRow, and especially JoinedRow,
is pretty expensive.
Author: Matei Zaharia <[email protected]>
Closes #1993 from mateiz/spark-3085 and squashes the following commits:
188221e [Matei Zaharia] Remove unneeded import
5f903ee [Matei Zaharia] [SPARK-3085] [SQL] Use compact data structures in
SQL joins
(cherry picked from commit 4bf3de71074053af94f077c99e9c65a1962739e1)
Signed-off-by: Michael Armbrust <[email protected]>
commit 496f62d9a98067256d8a51fd1e7a485ff6492fa8
Author: Patrick Wendell <[email protected]>
Date: 2014-08-18T17:52:20Z
SPARK-3025 [SQL]: Allow JDBC clients to set a fair scheduler pool
This definitely needs review as I am not familiar with this part of Spark.
I tested this locally and it did seem to work.
Author: Patrick Wendell <[email protected]>
Closes #1937 from pwendell/scheduler and squashes the following commits:
b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair
scheduler pool
(cherry picked from commit 6bca8898a1aa4ca7161492229bac1748b3da2ad7)
Signed-off-by: Michael Armbrust <[email protected]>
commit 2ae2857986e94d5a8bd5f4660eabe5689463bd21
Author: Matei Zaharia <[email protected]>
Date: 2014-08-18T18:00:10Z
[SPARK-3091] [SQL] Add support for caching metadata on Parquet files
For larger Parquet files, reading the file footers (which is done in
parallel on up to 5 threads) and HDFS block locations (which is serial) can
take multiple seconds. We can add an option to cache this data within
FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches
footers within each instance of ParquetInputFormat, not across them.
Note: this PR leaves this turned off by default for 1.1, but I believe it's
safe to turn it on after. The keys in the hash maps are FileStatus objects that
include a modification time, so this will work fine if files are modified. The
location cache could become invalid if files have moved within HDFS, but that's
rare so I just made it invalidate entries every 15 minutes.
Author: Matei Zaharia <[email protected]>
Closes #2005 from mateiz/parquet-cache and squashes the following commits:
dae8efe [Matei Zaharia] Bug fix
c71e9ed [Matei Zaharia] Handle empty statuses directly
22072b0 [Matei Zaharia] Use Guava caches and add a config option for
caching metadata
8fb56ce [Matei Zaharia] Cache file block locations too
453bd21 [Matei Zaharia] Bug fix
4094df6 [Matei Zaharia] First attempt at caching Parquet footers
(cherry picked from commit 9eb74c7d2cbe127dd4c32bf1a8318497b2fb55b6)
Signed-off-by: Michael Armbrust <[email protected]>
commit cc4015d2fa3785b92e6ab079b3abcf17627f7c56
Author: Michael Armbrust <[email protected]>
Date: 2014-08-18T20:17:10Z
[SPARK-2406][SQL] Initial support for using ParquetTableScan to read
HiveMetaStore tables.
This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet`
that when true causes the planner to detects tables that use Hive's Parquet
SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.
Author: Michael Armbrust <[email protected]>
Author: Yin Huai <[email protected]>
Closes #1819 from marmbrus/parquetMetastore and squashes the following
commits:
1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master'
into parquetMetastore
4f3d54f [Michael Armbrust] fix style
41ebc5f [Michael Armbrust] remove hive parquet bundle
a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master'
into parquetMetastore
4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this
later).
c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition. Add
dirty hacks to retrieve partition values from the InputSplit.
8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
a0baec7 [Yin Huai] Partitioning columns can be resolved.
1161338 [Michael Armbrust] Add a test to make sure conversion is actually
happening
212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to
read HiveMetaStore tables.
(cherry picked from commit 3abd0c1cda09bb575adc99847a619bc84af37fd0)
Signed-off-by: Michael Armbrust <[email protected]>
commit e083334634ca0d7a25dee864fb2b9558ee92a2f7
Author: Davies Liu <[email protected]>
Date: 2014-08-18T20:58:35Z
[SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8
bugfix: It will raise an exception when it try to encode non-ASCII strings
into unicode. It should only encode unicode as "utf-8".
Author: Davies Liu <[email protected]>
Closes #2018 from davies/fix_utf8 and squashes the following commits:
4db7967 [Davies Liu] fix saveAsTextFile() with utf-8
(cherry picked from commit d1d0ee41c27f1d07fed0c5d56ba26c723cc3dc26)
Signed-off-by: Josh Rosen <[email protected]>
commit 25cabd7eec6e499fce94bce0d45087e9d8726a50
Author: Marcelo Vanzin <[email protected]>
Date: 2014-08-18T21:10:10Z
[SPARK-2718] [yarn] Handle quotes and other characters in user args.
Due to the way Yarn runs things through bash, normal quoting doesn't
work as expected. This change applies the necessary voodoo to the user
args to avoid issues with bash and special characters.
The change also uncovered an issue with the event logger app name
sanitizing code; it wasn't cleaning up all "bad" characters, so
sometimes it would fail to create the log dirs. I just added some
more bad character replacements.
Author: Marcelo Vanzin <[email protected]>
Closes #1724 from vanzin/SPARK-2718 and squashes the following commits:
cc84b89 [Marcelo Vanzin] Review feedback.
c1a257a [Marcelo Vanzin] Add test for backslashes.
55571d4 [Marcelo Vanzin] Unbreak yarn-client.
515613d [Marcelo Vanzin] [SPARK-2718] [yarn] Handle quotes and other
characters in user args.
(cherry picked from commit 6201b27643023569e19b68aa9d5c4e4e59ce0d79)
Signed-off-by: Andrew Or <[email protected]>
commit 98778fffdb4e11593149eb7770071a0728653f19
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-18T21:40:05Z
[mllib] DecisionTree: treeAggregate + Python example bug fix
Small DecisionTree updates:
* Changed main DecisionTree aggregate to treeAggregate.
* Fixed bug in python example decision_tree_runner.py with missing argument
(since categoricalFeaturesInfo is no longer an optional argument for
trainClassifier).
* Fixed same bug in python doc tests, and added tree.py to doc tests.
CC: mengxr
Author: Joseph K. Bradley <[email protected]>
Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:
b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
8e4665d [Joseph K. Bradley] Added tree.py to python doc tests. Fixed bug
from missing categoricalFeaturesInfo argument.
b7b2922 [Joseph K. Bradley] Fixed bug in python example
decision_tree_runner.py with missing argument. Changed main DecisionTree
aggregate to treeAggregate.
85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt2
66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt2
a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.
Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main
change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second
level. Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change:
persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer
used. Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error. Added
some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods
private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own
file, and cleaned it up. Removed debugging println calls from DecisionTree.
Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification:
Updated calculateGainForSplit to take aggregates for a single (feature, split)
pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters +
small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint
representation to avoid calling findBin multiple times. * (not working yet, but
debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
(cherry picked from commit 115eeb30dd9c9dd10685a71f2c23ca23794d3142)
Signed-off-by: Xiangrui Meng <[email protected]>
commit e3f89e971b117e11d15e4b9b47e63da55f4e488b
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-19T01:01:39Z
[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey
Added sc.stop() to all examples.
CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value
RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.
Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function
python/run-tests script
* Added stat.py (doc test)
CC: mengxr dorx Main changes were examples to show usage across APIs.
Author: Joseph K. Bradley <[email protected]>
Closes #1878 from jkbradley/mllib-stats-api-check and squashes the
following commits:
ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and
sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python
versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use
NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed
statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for
numRows = 0, with error message. * computeCovariance(): Added check for numRows
<= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: *
Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both
test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample,
takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: *
Scala: StatisticalSummary.scala ** Tests: correlation,
MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests:
correlation (since MultivariateOnlineSummarizer has no Python API)
(cherry picked from commit c8b16ca0d86cc60fb960eebf0cb383f159a88b03)
Signed-off-by: Xiangrui Meng <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]