GitHub user changkaibo opened a pull request:
https://github.com/apache/spark/pull/8221
æ´æ°ææ°ä»£ç
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-1.5
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8221.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8221
----
commit 4c4f638c7333b44049c75ae34486148ab74db333
Author: Patrick Wendell <[email protected]>
Date: 2015-08-03T23:54:50Z
Preparing Spark release v1.5.0-snapshot-20150803
commit bc49ca468d3abe4949382a32de92f963f454d36a
Author: Patrick Wendell <[email protected]>
Date: 2015-08-03T23:54:56Z
Preparing development version 1.5.0-SNAPSHOT
commit 7e7147f3b8fee3ac4f2f1d14c3e6776a4d76bb3a
Author: Patrick Wendell <[email protected]>
Date: 2015-08-03T23:59:13Z
Preparing Spark release v1.5.0-snapshot-20150803
commit 74792e71cb0584637041cb81660ec3ac4ea10c0b
Author: Patrick Wendell <[email protected]>
Date: 2015-08-03T23:59:19Z
Preparing development version 1.5.0-SNAPSHOT
commit 73c863ac8e8f6cf664f51c64da1da695f341b273
Author: Matthew Brandyberry <[email protected]>
Date: 2015-08-04T00:36:56Z
[SPARK-9483] Fix UTF8String.getPrefix for big-endian.
Previous code assumed little-endian.
Author: Matthew Brandyberry <[email protected]>
Closes #7902 from mtbrandy/SPARK-9483 and squashes the following commits:
ec31df8 [Matthew Brandyberry] [SPARK-9483] Changes from review comments.
17d54c6 [Matthew Brandyberry] [SPARK-9483] Fix UTF8String.getPrefix for
big-endian.
(cherry picked from commit b79b4f5f2251ed7efeec1f4b26e45a8ea6b85a6a)
Signed-off-by: Davies Liu <[email protected]>
commit 34335719a372c1951fdb4dd25b75b086faf1076f
Author: Burak Yavuz <[email protected]>
Date: 2015-08-04T00:42:03Z
[SPARK-9263] Added flags to exclude dependencies when using --packages
While the functionality is there to exclude packages, there are no flags
that allow users to exclude dependencies, in case of dependency conflicts. We
should provide users with a flag to add dependency exclusions in case the
packages are not resolved properly (or not available due to licensing).
The flag I added was --packages-exclude, but I'm open on renaming it. I
also added property flags in case people would like to use a conf file to
provide dependencies, which is possible if there is a long list of dependencies
or exclusions.
cc andrewor14 vanzin pwendell
Author: Burak Yavuz <[email protected]>
Closes #7599 from brkyvz/packages-exclusions and squashes the following
commits:
636f410 [Burak Yavuz] addressed nits
6e54ede [Burak Yavuz] is this the culprit
b5e508e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into
packages-exclusions
154f5db [Burak Yavuz] addressed initial comments
1536d7a [Burak Yavuz] Added flags to exclude packages using
--packages-exclude
(cherry picked from commit 1633d0a2612d94151f620c919425026150e69ae1)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit 93076ae39b58ba8c4a459f2b3a8590c492dc5c4e
Author: CodingCat <[email protected]>
Date: 2015-08-04T01:20:40Z
[SPARK-8416] highlight and topping the executor threads in thread dumping
page
https://issues.apache.org/jira/browse/SPARK-8416
To facilitate debugging, I made this patch with three changes:
* render the executor-thread and non executor-thread entries with different
background colors
* put the executor threads on the top of the list
* sort the threads alphabetically
Author: CodingCat <[email protected]>
Closes #7808 from CodingCat/SPARK-8416 and squashes the following commits:
34fc708 [CodingCat] fix className
d7b79dd [CodingCat] lowercase threadName
d032882 [CodingCat] sort alphabetically and change the css class name
f0513b1 [CodingCat] change the color & group threads by name
2da6e06 [CodingCat] small fix
3fc9f36 [CodingCat] define classes in webui.css
8ee125e [CodingCat] highlight and put on top the executor threads in thread
dumping page
(cherry picked from commit 3b0e44490aebfba30afc147e4a34a63439d985c6)
Signed-off-by: Josh Rosen <[email protected]>
commit ebe42b98c8fa0cac6ec267e895402cebe8a670a9
Author: Reynold Xin <[email protected]>
Date: 2015-08-04T01:47:02Z
[SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
We often return abstract iterator types in various sort-related classes
(e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete
type, so the callsite uses that type and JIT can inline the iterator calls.
Author: Reynold Xin <[email protected]>
Closes #7911 from rxin/surface-concrete-type and squashes the following
commits:
0422add [Reynold Xin] [SPARK-9577][SQL] Surface concrete iterator types in
various sort classes.
(cherry picked from commit 5eb89f67e323dcf9fa3d5b30f9b5cb8f10ca1e8c)
Signed-off-by: Reynold Xin <[email protected]>
commit 1f7dbcd6fdeee22c7b670ea98dcb4e794f84a8cd
Author: Sean Owen <[email protected]>
Date: 2015-08-04T04:48:22Z
[SPARK-9521] [DOCS] Addendum. Require Maven 3.3.3+ in the build
Follow on for #7852: Building Spark doc needs to refer to new Maven
requirement too
Author: Sean Owen <[email protected]>
Closes #7905 from srowen/SPARK-9521.2 and squashes the following commits:
73285df [Sean Owen] Follow on for #7852: Building Spark doc needs to refer
to new Maven requirement too
(cherry picked from commit 0afa6fbf525723e97c6dacfdba3ad1762637ffa9)
Signed-off-by: Sean Owen <[email protected]>
commit 29f2d5a065254e7ed44fb204a1deecf9d44d338c
Author: Ankur Dave <[email protected]>
Date: 2015-08-04T06:07:32Z
[SPARK-3190] [GRAPHX] Fix VertexRDD.count() overflow regression
SPARK-3190 was originally fixed by
96df92906978c5f58e0cc8ff5eebe5b35a08be3b, but
a5ef58113667ff73562ce6db381cff96a0b354b0 introduced a regression during
refactoring. This commit fixes the regression.
Author: Ankur Dave <[email protected]>
Closes #7923 from ankurdave/SPARK-3190-reopening and squashes the following
commits:
a3e1b23 [Ankur Dave] Fix VertexRDD.count() overflow regression
(cherry picked from commit 9e952ecbce670e9b532a1c664a4d03b66e404112)
Signed-off-by: Reynold Xin <[email protected]>
commit 5ae675360d883483e509788b8867c1c98b4820fd
Author: Sean Owen <[email protected]>
Date: 2015-08-04T11:02:26Z
[SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of
build warnings, 1.5.0 edition
Enable most javac lint warnings; fix a lot of build warnings. In a few
cases, touch up surrounding code in the process.
I'll explain several of the changes inline in comments.
Author: Sean Owen <[email protected]>
Closes #7862 from srowen/SPARK-9534 and squashes the following commits:
ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build
warnings. In a few cases, touch up surrounding code in the process.
(cherry picked from commit 76d74090d60f74412bd45487e8db6aff2e8343a2)
Signed-off-by: Sean Owen <[email protected]>
commit bd9b7521343c34c42be40ee05a01c8a976ed2307
Author: tedyu <[email protected]>
Date: 2015-08-04T11:22:53Z
[SPARK-8064] [BUILD] Follow-up. Undo change from SPARK-9507 that was
accidentally reverted
This PR removes the dependency reduced POM hack brought back by #7191
Author: tedyu <[email protected]>
Closes #7919 from tedyu/master and squashes the following commits:
1bfbd7b [tedyu] [BUILD] Remove dependency reduced POM hack
(cherry picked from commit b211cbc7369af5eb2cb65d93c4c57c4db7143f47)
Signed-off-by: Sean Owen <[email protected]>
commit 45c8d2bb872bb905a402cf3aa78b1c4efaac07cf
Author: Carson Wang <[email protected]>
Date: 2015-08-04T13:12:30Z
[SPARK-2016] [WEBUI] RDD partition table pagination for the RDD Page
Add pagination for the RDD page to avoid unresponsive UI when the number of
the RDD partitions is large.
Before:

After:

Author: Carson Wang <[email protected]>
Closes #7692 from carsonwang/SPARK-2016 and squashes the following commits:
03c7168 [Carson Wang] Fix style issues
612c18c [Carson Wang] RDD partition table pagination for the RDD Page
(cherry picked from commit cb7fa0aa93dae5a25a8e7e387dbd6b55a5a23fb0)
Signed-off-by: Kousuke Saruta <[email protected]>
commit f44b27a2b92da2325ed9389cd27b6e2cfd9ec486
Author: Marcelo Vanzin <[email protected]>
Date: 2015-08-04T13:19:11Z
[SPARK-9583] [BUILD] Do not print mvn debug messages to stdout.
This allows build/mvn to be used by make-distribution.sh.
Author: Marcelo Vanzin <[email protected]>
Closes #7915 from vanzin/SPARK-9583 and squashes the following commits:
6469e60 [Marcelo Vanzin] [SPARK-9583] [build] Do not print mvn debug
messages to stdout.
(cherry picked from commit d702d53732b44e8242448ce5302738bd130717d8)
Signed-off-by: Kousuke Saruta <[email protected]>
commit 945da3534762a73fe7ffc52c868ff07a0783502b
Author: Tarek Auel <[email protected]>
Date: 2015-08-04T15:59:42Z
[SPARK-8244] [SQL] string function: find in set
This PR is based on #7186 (just fix the conflict), thanks to tarekauel .
find_in_set(string str, string strList): int
Returns the first occurance of str in strList where strList is a
comma-delimited string. Returns null if either argument is null. Returns 0 if
the first argument contains any commas. For example, find_in_set('ab',
'abc,b,ab,c,def') returns 3.
Only add this to SQL, not DataFrame.
Closes #7186
Author: Tarek Auel <[email protected]>
Author: Davies Liu <[email protected]>
Closes #7900 from davies/find_in_set and squashes the following commits:
4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
find_in_set
8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
find_in_set
243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility
1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix
e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8
0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244
208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix
71b2e69 [Tarek Auel] [SPARK-8244] find_in_set
66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244
61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String
comparison
4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244
e3b20c8 [Tarek Auel] [SPARK-8244] added type check
1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet
commit b42e13dca38c6e9ff9cf879bcb52efa681437120
Author: Davies Liu <[email protected]>
Date: 2015-08-04T16:07:09Z
[SPARK-8246] [SQL] Implement get_json_object
This is based on #7485 , thanks to NathanHowell
Tests were copied from Hive, but do not seem to be super comprehensive.
I've generally replicated Hive's unusual behavior rather than following a
JSONPath reference, except for one case (as noted in the comments). I don't
know if there is a way of fully replicating Hive's behavior without a slower
TreeNode implementation, so I've erred on the side of performance instead.
Author: Davies Liu <[email protected]>
Author: Yin Huai <[email protected]>
Author: Nathan Howell <[email protected]>
Closes #7901 from davies/get_json_object and squashes the following commits:
3ace9b9 [Davies Liu] Merge branch 'get_json_object' of
github.com:davies/spark into get_json_object
98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into
get_json_object
a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala
c818519 [Yin Huai] new results.
18ce26b [Davies Liu] fix tests
6ac29fb [Yin Huai] Golden files.
25eebef [Davies Liu] use HiveQuerySuite
e0ac6ec [Yin Huai] Golden answer files.
940c060 [Davies Liu] tweat code style
44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
get_json_object
9192d09 [Nathan Howell] Match Hiveâs behavior for unwrapping arrays of
one element
8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object
(cherry picked from commit 73dedb589d06f7c7a525cc4f07721a77f480c434)
Signed-off-by: Davies Liu <[email protected]>
commit d875368edd7265cedf808c921c0af0deb4895a67
Author: Yijie Shen <[email protected]>
Date: 2015-08-04T16:09:52Z
[SPARK-9541] [SQL] DataTimeUtils cleanup
JIRA: https://issues.apache.org/jira/browse/SPARK-9541
Author: Yijie Shen <[email protected]>
Closes #7870 from yjshen/datetime_cleanup and squashes the following
commits:
9203e33 [Yijie Shen] revert getMonth & getDayOfMonth
5cad119 [Yijie Shen] rebase code
7d62a74 [Yijie Shen] remove tmp tuple inside split date
e98aaac [Yijie Shen] DataTimeUtils cleanup
(cherry picked from commit b5034c9c59947f20423faa46bc6606aad56836b0)
Signed-off-by: Davies Liu <[email protected]>
commit aa8390dfcbb45eeff3d5894cf9b2edbd245b7320
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-08-04T16:40:07Z
[SPARK-9562] Change reference to amplab/spark-ec2 from mesos/
cc srowen pwendell nchammas
Author: Shivaram Venkataraman <[email protected]>
Closes #7899 from shivaram/spark-ec2-move and squashes the following
commits:
7cc22c9 [Shivaram Venkataraman] Change reference to amplab/spark-ec2 from
mesos/
(cherry picked from commit 6a0f8b994de36b7a7bdfb9958d39dbd011776107)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit a9277cd5aedd570f550e2a807768c8ffada9576f
Author: Michael Armbrust <[email protected]>
Date: 2015-08-04T17:07:53Z
[SPARK-9512][SQL] Revert SPARK-9251, Allow evaluation while sorting
The analysis rule has a bug and we ended up making the sorter still capable
of doing evaluation, so lets revert this for now.
Author: Michael Armbrust <[email protected]>
Closes #7906 from marmbrus/revertSortProjection and squashes the following
commits:
2da6972 [Michael Armbrust] unrevert unrelated changes
4f2b00c [Michael Armbrust] Revert "[SPARK-9251][SQL] do not order by
expressions which still need evaluation"
(cherry picked from commit 34a0eb2e89d59b0823efc035ddf2dc93f19540c1)
Signed-off-by: Michael Armbrust <[email protected]>
commit c5250ddc5242a071549e980f69fa8bd785168979
Author: Holden Karau <[email protected]>
Date: 2015-08-04T17:12:22Z
[SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier
This PR replaces the old "threshold" with a generalized "thresholds" Param.
We keep getThreshold,setThreshold for backwards compatibility for binary
classification.
Note that the primary author of this PR is holdenk
Author: Holden Karau <[email protected]>
Author: Joseph K. Bradley <[email protected]>
Closes #7909 from
jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and
squashes the following commits:
3952977 [Joseph K. Bradley] fixed pyspark doc test
85febc8 [Joseph K. Bradley] made python unit tests a little more robust
7eb1d86 [Joseph K. Bradley] small cleanups
6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues.
0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests
7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method
similar to our LogisticRegression.scala one for API compat
be87f26 [Holden Karau] Convert threshold to thresholds in the python code,
add specialized support for Array[Double] to shared parems codegen, etc.
6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier,
fix some tests
25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression
c02d6c0 [Holden Karau] No default for thresholds
5e43628 [Holden Karau] CR feedback and fixed the renamed test
f3fbbd1 [Holden Karau] revert the changes to random forest :(
51f581c [Holden Karau] Add explicit types to public methods, fix long line
f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes
adf15b4 [Holden Karau] rename the classifier suite test to
ProbabilisticClassifierSuite now that we only have it in Probabilistic
398078a [Holden Karau] move the thresholding around a bunch based on the
design doc
4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied
(one tree for each) and the switch from different max methods picked a
different element (since they were equal I think this is ok)
638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based
on corresponding python test
e09919c [Holden Karau] Fix return type, I need more coffee....
8d92cac [Holden Karau] Use ClassifierParams as the head
3456ed3 [Holden Karau] Add explicit return types even though just test
a0f3b0c [Holden Karau] scala style fixes
6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root
classifier now
ffc8dab [Holden Karau] Update the sharedParams
0420290 [Holden Karau] Allow us to override the get methods selectively
978e77a [Holden Karau] Move HasThreshold into classifier params and start
defining the overloaded getThreshold/getThresholds functions
1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API
so no dice there"
1f09a2e [Holden Karau] try and hide threshold but chainges the API so no
dice there
efb9084 [Holden Karau] move setThresholds only to where its used
6b34809 [Holden Karau] Add a test with thresholding for the RFCS
74f54c3 [Holden Karau] Fix creation of vote array
1986fa8 [Holden Karau] Setting the thresholds only makes sense if the
underlying class hasn't overridden predict, so lets push it down.
2f44b18 [Holden Karau] Add a global default of null for thresholds param
f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress
towards unifying threshold and thresholds"
634b06f [Holden Karau] Some progress towards unifying threshold and
thresholds
85c9e01 [Holden Karau] Test passes again... little fnur
099c0f3 [Holden Karau] Move thresholds around some more (set on model not
trainer)
0f46836 [Holden Karau] Start adding a classifiersuite
f70eb5e [Holden Karau] Fix test compile issues
a7d59c8 [Holden Karau] Move thresholding into Classifier trait
5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try
and see if we can find a better thing to use for the base of the test)
1fed644 [Holden Karau] Use thresholds to scale scores in random forest
classifcation
31d6bf2 [Holden Karau] Start threading the threshold info through
0ef228c [Holden Karau] Add hasthresholds
(cherry picked from commit 5a23213c148bfe362514f9c71f5273ebda0a848a)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit be37b1bd3edd8583180dc1a41ecf4d80990216c7
Author: Michael Armbrust <[email protected]>
Date: 2015-08-04T19:19:52Z
[SPARK-9606] [SQL] Ignore flaky thrift server tests
Author: Michael Armbrust <[email protected]>
Closes #7939 from marmbrus/turnOffThriftTests and squashes the following
commits:
80d618e [Michael Armbrust] [SPARK-9606][SQL] Ignore flaky thrift server
tests
(cherry picked from commit a0cc01759b0c2cecf340c885d391976eb4e3fad6)
Signed-off-by: Michael Armbrust <[email protected]>
commit 43f6b021e5f14b9126e4291f989a076085367c2c
Author: Wenchen Fan <[email protected]>
Date: 2015-08-04T21:40:46Z
[SPARK-9553][SQL] remove the no-longer-necessary createCode and
createStructCode, and replace the usage
Author: Wenchen Fan <[email protected]>
Closes #7890 from cloud-fan/minor and squashes the following commits:
c3b1be3 [Wenchen Fan] fix style
b0cbe2e [Wenchen Fan] remove the createCode and createStructCode, and
replace the usage of them by createStructCode
(cherry picked from commit f4b1ac08a1327e6d0ddc317cdf3997a0f68dec72)
Signed-off-by: Reynold Xin <[email protected]>
commit f771a83f4090e979f72d01989e6693d7fbc05c05
Author: Josh Rosen <[email protected]>
Date: 2015-08-04T21:42:11Z
[SPARK-9452] [SQL] Support records larger than page size in
UnsafeExternalSorter
This patch extends UnsafeExternalSorter to support records larger than the
page size. The basic strategy is the same as in #7762: store large records in
their own overflow pages.
Author: Josh Rosen <[email protected]>
Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the
following commits:
967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into
large-records-in-sql-sorter
948c344 [Josh Rosen] Add large records tests for KV sorter.
3c17288 [Josh Rosen] Combine memory and disk cleanup into general
cleanupResources() method
380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into
large-records-in-sql-sorter
27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite
a49baef [Josh Rosen] Address initial round of review comments
3edb931 [Josh Rosen] Remove accidentally-committed debug statements.
2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter.
(cherry picked from commit ab8ee1a3b93286a62949569615086ef5030e9fae)
Signed-off-by: Reynold Xin <[email protected]>
commit 560b2da783bc25bd8767f6888665dadecac916d8
Author: CodingCat <[email protected]>
Date: 2015-08-04T21:54:11Z
[SPARK-9602] remove "Akka/Actor" words from comments
https://issues.apache.org/jira/browse/SPARK-9602
Although we have hidden Akka behind RPC interface, I found that the
Akka/Actor-related comments are still spreading everywhere. To make it
consistent, we shall remove "actor"/"akka" words from the comments...
Author: CodingCat <[email protected]>
Closes #7936 from CodingCat/SPARK-9602 and squashes the following commits:
e8296a3 [CodingCat] remove actor words from comments
(cherry picked from commit 9d668b73687e697cad2ef7fd3c3ba405e9795593)
Signed-off-by: Reynold Xin <[email protected]>
commit e682ee25477374737f3b1dfc08c98829564b26d4
Author: Joseph K. Bradley <[email protected]>
Date: 2015-08-04T21:54:26Z
[SPARK-9447] [ML] [PYTHON] Added HasRawPredictionCol, HasProbabilityCol to
RandomForestClassifier
Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier,
plus doc tests for those columns.
CC: holdenk yanboliang
Author: Joseph K. Bradley <[email protected]>
Closes #7903 from jkbradley/rf-prob-python and squashes the following
commits:
c62a83f [Joseph K. Bradley] made unit test more robust
14eeba2 [Joseph K. Bradley] added HasRawPredictionCol, HasProbabilityCol to
RandomForestClassifier in PySpark
(cherry picked from commit e375456063617cd7000d796024f41e5927f21edd)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit fe4a4f41ad8b686455d58fc2fda9494e8dba5636
Author: Joseph K. Bradley <[email protected]>
Date: 2015-08-04T22:43:13Z
[SPARK-9582] [ML] LDA cleanups
Small cleanups to recent LDA additions and docs.
CC: feynmanliang
Author: Joseph K. Bradley <[email protected]>
Closes #7916 from jkbradley/lda-cleanups and squashes the following commits:
f7021d9 [Joseph K. Bradley] broadcasting large matrices for LDA in local
model and online learning
97947aa [Joseph K. Bradley] a few more cleanups
5b03f88 [Joseph K. Bradley] reverted split of lda log likelihood
c566915 [Joseph K. Bradley] small edit to make review easier
63f6c7d [Joseph K. Bradley] clarified log likelihood for lda models
(cherry picked from commit 1833d9c08f021d991334424d0a6d5ec21d1fccb2)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit f4e125acf36023425722abb0fb74be63a425aa7b
Author: Mike Dusenberry <[email protected]>
Date: 2015-08-04T23:30:03Z
[SPARK-6485] [MLLIB] [PYTHON] Add
CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.
This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix
distributed matrices to PySpark. Each distributed matrix class acts as a
wrapper around the Scala/Java counterpart by maintaining a reference to the
Java object. New distributed matrices can be created using factory methods
added to DistributedMatrices, which creates the Java distributed matrix and
then wraps it with the corresponding PySpark class. This design allows for
simple conversion between the various distributed matrices, and lets us re-use
the Scala code. Serialization between Python and Java is implemented using
DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity.
Associated documentation and unit-tests have also been added. To facilitate
code review, this PR implements access to the rows/entries as RDDs, the number
of rows & columns, and conversions between the various distributed matrices
(not including BlockMatrix), and does not implement the other linear algebra
funct
ions of the matrices, although this will be very simple to add now.
Author: Mike Dusenberry <[email protected]>
Closes #7554 from
dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark
and squashes the following commits:
bb039cb [Mike Dusenberry] Minor documentation update.
b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to
make it even cleaner. Now, we allow the 'rows' parameter in the constructors
to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a
Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix
object of the correct type, we just wrap and store that directly. This is only
for internal usage, and publicly, we still require 'rows' to be an RDD. We no
longer store the 'rows' RDD, and instead just compute it from the Java object
when needed. The point of this is that when we do matrix conversions, we do
the conversion on the Scala/Java side, which returns a Java object, so we
should use that directly, but exposing 'java_matrix' parameter in the public
API is not ideal. This non-public feature of allowing 'rows' to be a Java
matrix object is documented in the '__init__' constructor docstrings, which are
not part of the generated public API, and doctests are also include
d.
7f0dcb6 [Mike Dusenberry] Updating module docstring.
cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)'
rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the
SparkContext will be the same as for the matrix.rows data.
687e345 [Mike Dusenberry] Improving conversion performance. This adds an
optional 'java_matrix' parameter to the constructors, and pulls the conversion
logic out into a '_create_from_java' function. Now, if the constructors are
given a valid Java distributed matrix object as 'java_matrix', they will store
those internally, rather than create a new one on the Scala/Java side.
3e50b6e [Mike Dusenberry] Moving the distributed matrices to
pyspark.mllib.linalg.distributed.
308f197 [Mike Dusenberry] Using properties for better documentation.
1633f86 [Mike Dusenberry] Minor documentation cleanup.
f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from
DistributedMatrix.
ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
3fd4016 [Mike Dusenberry] Updating docstrings.
27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors
for each distributed matrix.
a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using
lists instead of DenseVectors explicitly.
d19b0ba [Mike Dusenberry] Updating code and documentation to note that a
vector-like object (numpy array, list, etc.) can be used in place of explicit
Vector object, and adding conversions when necessary to RowMatrix construction.
4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and
MatrixEntry.
c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow
or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix
constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry
functions.
329638b [Mike Dusenberry] Moving the Experimental tag to the top of each
docstring.
0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated
rows/entries RDDs within the various tests.
c0900df [Mike Dusenberry] Adding the colons that were accidentally not
inserted.
4ad6819 [Mike Dusenberry] Documenting the and parameters.
3b854b9 [Mike Dusenberry] Minor updates to documentation.
10046e8 [Mike Dusenberry] Updating documentation to use class constructors
instead of the removed DistributedMatrices factory methods.
119018d [Mike Dusenberry] Adding static methods to each of the distributed
matrix classes to consolidate conversion logic.
4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although
it is slightly verbose, it is better for the user to have a good error message
than a cryptic stacktrace.
93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out
of this pull request.
f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out
of this pull request.
6a3ecb7 [Mike Dusenberry] Updating pattern matching.
08f287b [Mike Dusenberry] Slight reformatting of the documentation.
a245dc0 [Mike Dusenberry] Updating Python doctests for compatability
between Python 2 & 3. Since Python 3 removed the idea of a separate 'long'
type, all values that would have been outputted as a 'long' (ex: '4L') will now
be treated as an 'int' and outputed as one (ex: '4'). The doctests now
explicitly convert to ints so that both Python 2 and 3 will have the same
output. This is fine since the values are all small, and thus can be easily
represented as ints.
4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
7e3ca16 [Mike Dusenberry] Fixing long lines.
f721ead [Mike Dusenberry] Updating documentation for each of the
distributed matrices.
ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the
various distributed matrices. Added logic to be able to access the
rows/entries of the distributed matrices, which requires serialization through
DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark,
following the idea of the IndexedRowMatrix API, including using DataFrames for
serialization.
3c369cb [Mike Dusenberry] Updating the architecture a bit to make
conversions between the various distributed matrix types easier. The different
distributed matrix classes are now only wrappers around the Java objects, and
take the Java object as an argument during construction. This way, we can call
for example on an , which returns a reference to a Java RowMatrix object, and
then construct a PySpark RowMatrix object wrapped around the Java object. This
is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate
creation of the various distributed matrices from scratch in PySpark to the
factory methods on .
4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark,
following the idea of the RowMatrix API. Note that for the IndexedRowMatrix,
we use DataFrames to serialize the data between Python and Scala/Java, so we
accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on
the Scala/Java side before constructing the IndexedRowMatrix.
23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix.
Inserting newline above doctest so that it renders properly in API docs.
b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix
simply create and keep a reference to a wrapper over a Java RowMatrix.
Updating DistributedMatrices factory methods to accept numRows and numCols with
default values. Updating PySpark DistributedMatrices factory method to simply
create a PySpark RowMatrix. Adding additional doctests for numRows and numCols
parameters.
bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing
the following: Added a DistributedMatrices class to contain factory methods for
creating the various distributed matrices. Added a factory method for creating
a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the
PythonMLlibAPI to interface with the factory method. Added DistributedMatrix,
DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
(cherry picked from commit 571d5b5363ff4dbbce1f7019ab8e86cbc3cba4d5)
Signed-off-by: Xiangrui Meng <[email protected]>
commit cff0fe291aa470ef5cf4e5087c7114fb6360572f
Author: Joseph K. Bradley <[email protected]>
Date: 2015-08-04T23:52:43Z
[SPARK-9586] [ML] Update BinaryClassificationEvaluator to use
setRawPredictionCol
Update BinaryClassificationEvaluator to use setRawPredictionCol, rather
than setScoreCol. Deprecated setScoreCol.
I don't think setScoreCol was actually used anywhere (based on search).
CC: mengxr
Author: Joseph K. Bradley <[email protected]>
Closes #7921 from jkbradley/binary-eval-rawpred and squashes the following
commits:
e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use
setRawPredictionCol
(cherry picked from commit b77d3b9688d56d33737909375d1d0db07da5827b)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 1954a7bb175122b776870530217159cad366ca6c
Author: Wenchen Fan <[email protected]>
Date: 2015-08-05T00:05:19Z
[SPARK-9598][SQL] do not expose generic getter in internal row
Author: Wenchen Fan <[email protected]>
Closes #7932 from cloud-fan/generic-getter and squashes the following
commits:
c60de4c [Wenchen Fan] do not expose generic getter in internal row
(cherry picked from commit 7c8fc1f7cb837ff5c32811fdeb3ee2b84de2dea4)
Signed-off-by: Reynold Xin <[email protected]>
commit 33509754843fe8eba303c720e6c0f6853b861e7e
Author: Feynman Liang <[email protected]>
Date: 2015-08-05T01:13:18Z
[SPARK-9609] [MLLIB] Fix spelling of Strategy.defaultStrategy
jkbradley
Author: Feynman Liang <[email protected]>
Closes #7941 from feynmanliang/SPARK-9609-stategy-spelling and squashes the
following commits:
d2aafb1 [Feynman Liang] Add deprecated backwards compatibility
aa090a8 [Feynman Liang] Fix spelling
(cherry picked from commit 629e26f7ee916e70f59b017cb6083aa441b26b2c)
Signed-off-by: Joseph K. Bradley <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]