GitHub user oza opened a pull request:
https://github.com/apache/spark/pull/3512
[SPARK-4651] Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with
newer versions of Hadoop
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/oza/spark SPARK-4651
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3512.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3512
----
commit 44856654c81ceb92ef6380691027744d4bf76589
Author: Hari Shreedharan <[email protected]>
Date: 2014-08-18T02:50:31Z
[HOTFIX][STREAMING] Allow the JVM/Netty to decide which port to bind to in
Flume Polling Tests.
Author: Hari Shreedharan <[email protected]>
Closes #1820 from harishreedharan/use-free-ports and squashes the following
commits:
b939067 [Hari Shreedharan] Remove unused import.
67856a8 [Hari Shreedharan] Remove findFreePort.
0ea51d1 [Hari Shreedharan] Make some changes to getPort to use map on the
serverOpt.
1fb0283 [Hari Shreedharan] Merge branch 'master' of
https://github.com/apache/spark into use-free-ports
b351651 [Hari Shreedharan] Allow Netty to choose port, and query it to
decide the port to bind to. Leaving findFreePort as is, if other tests want to
use it at some point.
e6c9620 [Hari Shreedharan] Making sure the second sink uses the correct
port.
11c340d [Hari Shreedharan] Add info about race condition to scaladoc.
e89d135 [Hari Shreedharan] Adding Scaladoc.
6013bb0 [Hari Shreedharan] [STREAMING] Find free ports to use before
attempting to create Flume Sink in Flume Polling Suite
commit 1d5e84a99076d3e0168dd2f4626c7911e7ba49e7
Author: Patrick Wendell <[email protected]>
Date: 2014-08-21T05:24:22Z
HOTFIX:Temporarily removing flume sink test in 1.1 branch
commit e1535ad3c6f7400f2b7915ea91da9c60510557ba
Author: Patrick Wendell <[email protected]>
Date: 2014-08-21T05:54:41Z
[maven-release-plugin] prepare release v1.1.0-snapshot2
commit 9af3fb7385d1f9f221962f1d2d725ff79bd82033
Author: Patrick Wendell <[email protected]>
Date: 2014-08-21T05:54:48Z
[maven-release-plugin] prepare for next development iteration
commit da0a701204ae057581ed2d41eba5bb610e36c864
Author: Patrick Wendell <[email protected]>
Date: 2014-08-20T19:18:41Z
BUILD: Bump Hadoop versions in the release build.
Also, minor modifications to the MapR profile.
commit 1e5d9cbb499199304aa8820114fa77dc7a3f0224
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-21T07:17:29Z
[SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples)
Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr @manishamde @atalwalkar
Author: Joseph K. Bradley <[email protected]>
Closes #2063 from jkbradley/dt-docs and squashes the following commits:
2dd2c19 [Joseph K. Bradley] Last updates based on github review.
9dd1b6b [Joseph K. Bradley] Updated decision tree doc.
d802369 [Joseph K. Bradley] Updates based on comments: cache data,
corrected doc text.
b9bee04 [Joseph K. Bradley] Updated DT examples
57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example
in docs, and corrected doc example as needed.
d939a92 [Joseph K. Bradley] Updated DecisionTree documentation. Added
Java, Python examples.
(cherry picked from commit 050f8d01e47b9b67b02ce50d83fb7b4e528b7204)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 385c4f2af5996844b9761942643f71a6544e1dd8
Author: Patrick Wendell <[email protected]>
Date: 2014-08-23T04:31:52Z
Revert "HOTFIX:Temporarily removing flume sink test in 1.1 branch"
This reverts commit 1d5e84a99076d3e0168dd2f4626c7911e7ba49e7.
commit cd73631b15f080405e04203bf15fbd31c65eb64a
Author: Tathagata Das <[email protected]>
Date: 2014-08-23T04:34:48Z
[SPARK-3169] Removed dependency on spark streaming test from spark flume
sink
Due to maven bug https://jira.codehaus.org/browse/MNG-1378, maven could not
resolve spark streaming classes required by the spark-streaming test-jar
dependency of external/flume-sink. There is no particular reason that the
external/flume-sink has to depend on Spark Streaming at all, so I am
eliminating this dependency. Also I have removed the exclusions present in the
Flume dependencies, as there is no reason to exclude them (they were excluded
in the external/flume module to prevent dependency collisions with Spark).
Since Jenkins will test the sbt build and the unit test, I only tested
maven compilation locally.
Author: Tathagata Das <[email protected]>
Closes #2101 from tdas/spark-sink-pom-fix and squashes the following
commits:
8f42621 [Tathagata Das] Added Flume sink exclusions back, and added netty
to test dependencies
93b559f [Tathagata Das] Removed dependency on spark streaming test from
spark flume sink
(cherry picked from commit 3004074152b7261c2a968bb8e94ec7c41a7b43c1)
Signed-off-by: Patrick Wendell <[email protected]>
commit 568966018bff437f1d73cd59eb4681b2d3e87b48
Author: Kousuke Saruta <[email protected]>
Date: 2014-08-23T05:28:05Z
[SPARK-2963] REGRESSION - The description about how to build for using CLI
and Thrift JDBC server is absent in proper document -
The most important things I mentioned in #1885 is as follows.
* People who build Spark is not always programmer.
* If a person who build Spark is not a programmer, he/she won't read
programmer's guide before building.
So, how to build for using CLI and JDBC server is not only in programmer's
guide.
Author: Kousuke Saruta <[email protected]>
Closes #2080 from sarutak/SPARK-2963 and squashes the following commits:
ee07c76 [Kousuke Saruta] Modified regression of the description about
building for using Thrift JDBC server and CLI
ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
07c59fc [Kousuke Saruta] Added a description about how to build to use
HiveServer and CLI for SparkSQL to building-with-maven.md
6e6645a [Kousuke Saruta] Merge branch 'master' of
git://git.apache.org/spark into SPARK-2963
c88fa93 [Kousuke Saruta] Added a description about building to use
HiveServer and CLI for SparkSQL
commit 9309786416c83b2f3401724fdeb19c2be07c0431
Author: Yin Huai <[email protected]>
Date: 2014-08-23T19:46:41Z
[SQL] Make functionRegistry in HiveContext transient.
Seems we missed `transient` for the `functionRegistry` in `HiveContext`.
cc: marmbrus
Author: Yin Huai <[email protected]>
Closes #2074 from yhuai/makeFunctionRegistryTransient and squashes the
following commits:
6534e7d [Yin Huai] Make functionRegistry transient.
(cherry picked from commit 2fb1c72ea21e137c8b60a72e5aecd554c71b16e1)
Signed-off-by: Michael Armbrust <[email protected]>
commit 7112da8fe8d382a1180118f206db78f8e610d83f
Author: Michael Armbrust <[email protected]>
Date: 2014-08-23T23:19:10Z
[SPARK-2554][SQL] CountDistinct partial aggregation and object allocation
improvements
Author: Michael Armbrust <[email protected]>
Author: Gregory Owen <[email protected]>
Closes #1935 from marmbrus/countDistinctPartial and squashes the following
commits:
5c7848d [Michael Armbrust] turn off caching in the constructor
8074a80 [Michael Armbrust] fix tests
32d216f [Michael Armbrust] reynolds comments
c122cca [Michael Armbrust] Address comments, add tests
b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master'
into countDistinctPartial
fae38f4 [Michael Armbrust] Fix style
fdca896 [Michael Armbrust] cleanup
93d0f64 [Michael Armbrust] metastore concurrency fix.
db44a30 [Michael Armbrust] JIT hax.
3868f6c [Michael Armbrust] Merge pull request #9 from
GregOwen/countDistinctPartial
c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo
2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master'
into countDistinctPartial
8ff6402 [Michael Armbrust] Add specific row.
58d15f1 [Michael Armbrust] disable codegen logging
87d101d [Michael Armbrust] Fix isNullAt bug
abee26d [Michael Armbrust] WIP
27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master'
into countDistinctPartial
57ae3b1 [Michael Armbrust] Fix order dependent test
b3d0f64 [Michael Armbrust] Add golden files.
c1f7114 [Michael Armbrust] Improve tests / fix serialization.
f31b8ad [Michael Armbrust] more fixes
38c7449 [Michael Armbrust] comments and style
9153652 [Michael Armbrust] better toString
d494598 [Michael Armbrust] Fix tests now that the planner is better
41fbd1d [Michael Armbrust] Never try and create an empty hash set.
050bb97 [Michael Armbrust] Skip no-arg constructors for kryo,
bd08239 [Michael Armbrust] WIP
213ada8 [Michael Armbrust] First draft of partially aggregated and code
generated count distinct / max
(cherry picked from commit 7e191fe29bb09a8560cd75d453c4f7f662dff406)
Signed-off-by: Michael Armbrust <[email protected]>
commit e23f0bc0177a83dfee3f5579ae6eb12033ae5f90
Author: Michael Armbrust <[email protected]>
Date: 2014-08-23T23:21:08Z
[SPARK-2967][SQL] Follow-up: Also copy hash expressions in sort based
shuffle fix.
Follow-up to #2066
Author: Michael Armbrust <[email protected]>
Closes #2072 from marmbrus/sortShuffle and squashes the following commits:
2ff8114 [Michael Armbrust] Fix bug
(cherry picked from commit 3519b5e8e55b4530d7f7c0bcab254f863dbfa814)
Signed-off-by: Michael Armbrust <[email protected]>
commit ce14cd11f099e46532074bc23a7ffb1bad0969e6
Author: Kousuke Saruta <[email protected]>
Date: 2014-08-24T16:43:44Z
[SPARK-3192] Some scripts have 2 space indentation but other scripts have 4
space indentation.
Author: Kousuke Saruta <[email protected]>
Closes #2104 from sarutak/SPARK-3192 and squashes the following commits:
db78419 [Kousuke Saruta] Modified indentation of spark-shell
(cherry picked from commit ded6796bf54f5c005b27135d7dec19634038a1c6)
Signed-off-by: Patrick Wendell <[email protected]>
commit a4db81a55f266f904052525aa290b7ffcf9a613c
Author: DB Tsai <[email protected]>
Date: 2014-08-25T00:33:33Z
[SPARK-2841][MLlib] Documentation for feature transformations
Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer
Author: DB Tsai <[email protected]>
Closes #2068 from dbtsai/transformer-documentation and squashes the
following commits:
109f324 [DB Tsai] address feedback
(cherry picked from commit 572952ae615895efaaabcd509d582262000c0852)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 749bddc85e76e0d1ded8d79058819335bd580741
Author: Reza Zadeh <[email protected]>
Date: 2014-08-25T00:35:54Z
[MLlib][SPARK-2997] Update SVD documentation to reflect roughly square
Update the documentation to reflect the fact we can handle roughly square
matrices.
Author: Reza Zadeh <[email protected]>
Closes #2070 from rezazadeh/svddocs and squashes the following commits:
826b8fe [Reza Zadeh] left singular vectors
3f34fc6 [Reza Zadeh] PCA is still TS
7ffa2aa [Reza Zadeh] better title
aeaf39d [Reza Zadeh] More docs
788ed13 [Reza Zadeh] add computational cost explanation
6429c59 [Reza Zadeh] Add link to rowmatrix docs
1eeab8b [Reza Zadeh] Update SVD documentation to reflect roughly square
(cherry picked from commit b1b20301b3a1b35564d61e58eb5964d5ad5e4d7d)
Signed-off-by: Xiangrui Meng <[email protected]>
commit b82da3d6924a5bd2139434ab05c2fd44914fda45
Author: Davies Liu <[email protected]>
Date: 2014-08-25T04:16:05Z
[SPARK-2871] [PySpark] add zipWithIndex() and zipWithUniqueId()
RDD.zipWithIndex()
Zips this RDD with its element indices.
The ordering is first based on the partition index and then the
ordering of items within each partition. So the first item in
the first partition gets index 0, and the last item in the last
partition receives the largest index.
This method needs to trigger a spark job when this RDD contains
more than one partitions.
>>> sc.parallelize(range(4), 2).zipWithIndex().collect()
[(0, 0), (1, 1), (2, 2), (3, 3)]
RDD.zipWithUniqueId()
Zips this RDD with generated unique Long ids.
Items in the kth partition will get ids k, n+k, 2*n+k, ..., where
n is the number of partitions. So there may exist gaps, but this
method won't trigger a spark job, which is different from
L{zipWithIndex}
>>> sc.parallelize(range(4), 2).zipWithUniqueId().collect()
[(0, 0), (2, 1), (1, 2), (3, 3)]
Author: Davies Liu <[email protected]>
Closes #2092 from davies/zipWith and squashes the following commits:
cebe5bf [Davies Liu] improve test cases, reverse the order of index
0d2a128 [Davies Liu] add zipWithIndex() and zipWithUniqueId()
(cherry picked from commit fb0db772421b6902b80137bf769db3b418ab2ccf)
Signed-off-by: Josh Rosen <[email protected]>
commit 69a17f119758e786ef080cfbf52d484334c8d9d9
Author: Xiangrui Meng <[email protected]>
Date: 2014-08-25T19:30:02Z
[SPARK-2495][MLLIB] make KMeans constructor public
to re-construct k-means models freeman-lab
Author: Xiangrui Meng <[email protected]>
Closes #2112 from mengxr/public-constructors and squashes the following
commits:
18d53a9 [Xiangrui Meng] make KMeans constructor public
(cherry picked from commit 220f413686ae922bd11776576bf37610cce92c23)
Signed-off-by: Xiangrui Meng <[email protected]>
commit ff616fd7b4b56c34bd473f85fab3524b842da404
Author: Sean Owen <[email protected]>
Date: 2014-08-25T20:29:07Z
SPARK-2798 [BUILD] Correct several small errors in Flume module pom.xml
files
(EDIT) Since the scalatest issue was since resolved, this is now about a
few small problems in the Flume Sink `pom.xml`
- `scalatest` is not declared as a test-scope dependency
- Its Avro version doesn't match the rest of the build
- Its Flume version is not synced with the other Flume module
- The other Flume module declares its dependency on Flume Sink slightly
incorrectly, hard-coding the Scala 2.10 version
- It depends on Scala Lang directly, which it shouldn't
Author: Sean Owen <[email protected]>
Closes #1726 from srowen/SPARK-2798 and squashes the following commits:
a46e2c6 [Sean Owen] scalatest to test scope, harmonize Avro and Flume
versions, remove direct Scala dependency, fix '2.10' in Flume dependency
(cherry picked from commit cd30db566a327ddf63cd242c758e46ce2d9479df)
Signed-off-by: Tathagata Das <[email protected]>
commit d892062cca16bd9d977e1cf51723135a481edf57
Author: Xiangrui Meng <[email protected]>
Date: 2014-08-25T21:55:20Z
[FIX] fix error message in sendMessageReliably
rxin
Author: Xiangrui Meng <[email protected]>
Closes #2120 from mengxr/sendMessageReliably and squashes the following
commits:
b14400c [Xiangrui Meng] fix error message in sendMessageReliably
(cherry picked from commit fd8ace2d9a796f69ce34ad202907008cd6e4d274)
Signed-off-by: Josh Rosen <[email protected]>
commit 8d33a6d3de9184ee33ebe5f30fef6a1fda281e9d
Author: Cheng Lian <[email protected]>
Date: 2014-08-25T21:56:51Z
Fixed a typo in docs/running-on-mesos.md
It should be `spark-env.sh` rather than `spark.env.sh`.
Author: Cheng Lian <[email protected]>
Closes #2119 from liancheng/fix-mesos-doc and squashes the following
commits:
f360548 [Cheng Lian] Fixed a typo in docs/running-on-mesos.md
(cherry picked from commit 805fec845b7aa8b4763e3e0e34bec6c3872469f4)
Signed-off-by: Josh Rosen <[email protected]>
commit 19b01d6f79f2919257fcd14524bc8267c57eb3d9
Author: Takuya UESHIN <[email protected]>
Date: 2014-08-25T23:27:00Z
[SPARK-3204][SQL] MaxOf would be foldable if both left and right are
foldable.
Author: Takuya UESHIN <[email protected]>
Closes #2116 from ueshin/issues/SPARK-3204 and squashes the following
commits:
7d9b107 [Takuya UESHIN] Make MaxOf foldable if both left and right are
foldable.
(cherry picked from commit d299e2bf2f6733a6267b7ce85e2b288608b17db3)
Signed-off-by: Michael Armbrust <[email protected]>
commit 292f28d4f7cbfdb8b90809926a6d69df7ed817e7
Author: Cheng Lian <[email protected]>
Date: 2014-08-25T23:29:59Z
[SPARK-2929][SQL] Refactored Thrift server and CLI suites
Removed most hard coded timeout, timing assumptions and all `Thread.sleep`.
Simplified IPC and synchronization with `scala.sys.process` and future/promise
so that the test suites can run more robustly and faster.
Author: Cheng Lian <[email protected]>
Closes #1856 from liancheng/thriftserver-tests and squashes the following
commits:
2d914ca [Cheng Lian] Minor refactoring
0e12e71 [Cheng Lian] Cleaned up test output
0ee921d [Cheng Lian] Refactored Thrift server and CLI suites
(cherry picked from commit cae9414d3805c6cf00eab6a6144d8f90cd0212f8)
Signed-off-by: Michael Armbrust <[email protected]>
commit f8ac8ed7f88d2ee976b38d4a156f64efb3740650
Author: Cheng Hao <[email protected]>
Date: 2014-08-26T00:43:56Z
[SPARK-3058] [SQL] Support EXTENDED for EXPLAIN
Provide `extended` keyword support for `explain` command in SQL. e.g.
```
explain extended select key as a1, value as a2 from src where key=1;
== Parsed Logical Plan ==
Project ['key AS a1#3,'value AS a2#4]
Filter ('key = 1)
UnresolvedRelation None, src, None
== Analyzed Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
MetastoreRelation default, src, None
== Optimized Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
Filter (CAST(key#8, DoubleType) = 1.0)
MetastoreRelation default, src, None
== Physical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
Filter (CAST(key#8, DoubleType) = 1.0)
HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None),
None
Code Generation: false
== RDD ==
(2) MappedRDD[14] at map at HiveContext.scala:350
MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
MappedRDD[10] at map at TableReader.scala:240
HadoopRDD[9] at HadoopRDD at TableReader.scala:230
```
It's the sub task of #1847. But can go without any dependency.
Author: Cheng Hao <[email protected]>
Closes #1962 from chenghao-intel/explain_extended and squashes the
following commits:
295db74 [Cheng Hao] Fix bug in printing the simple execution plan
48bc989 [Cheng Hao] Support EXTENDED for EXPLAIN
(cherry picked from commit 156eb3966176de02ec3ec90ae10e50a7ebfbbf4f)
Signed-off-by: Michael Armbrust <[email protected]>
commit 957b356576caa2ab38d1e758c2d3190421894557
Author: wangfei <[email protected]>
Date: 2014-08-26T00:46:43Z
[SQL] logWarning should be logInfo in getResultSetSchema
Author: wangfei <[email protected]>
Closes #1939 from scwf/patch-5 and squashes the following commits:
f952d10 [wangfei] [SQL] logWarning should be logInfo in getResultSetSchema
(cherry picked from commit 507a1b520063ad3e10b909767d9e3fd72d24415b)
Signed-off-by: Michael Armbrust <[email protected]>
commit b5dc9b43bcdcbdb5ffddbda6235443f3d7411b7a
Author: Chia-Yung Su <[email protected]>
Date: 2014-08-26T01:20:19Z
[SPARK-3011][SQL] _temporary directory should be filtered out by
sqlContext.parquetFile
fix compile error on hadoop 0.23 for the pull request #1924.
Author: Chia-Yung Su <[email protected]>
Closes #1959 from joesu/bugfix-spark3011 and squashes the following commits:
be30793 [Chia-Yung Su] remove .* and _* except _metadata
8fe2398 [Chia-Yung Su] add note to explain
40ea9bd [Chia-Yung Su] fix hadoop-0.23 compile error
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
(cherry picked from commit 4243bb6634aca5b9ddf6d42778aa7b4866ce6256)
Signed-off-by: Michael Armbrust <[email protected]>
commit 4d6a0e920974a5d50348ba9f7377b48e43c2da16
Author: witgo <[email protected]>
Date: 2014-08-26T02:22:27Z
SPARK-2481: The environment variables SPARK_HISTORY_OPTS is covered in
spark-env.sh
Author: witgo <[email protected]>
Author: GuoQiang Li <[email protected]>
Closes #1341 from witgo/history_env and squashes the following commits:
b4fd9f8 [GuoQiang Li] review commit
0ebe401 [witgo] *-history-server.sh load spark-config.sh
(cherry picked from commit 9f04db17e50568d5580091add9100693177d7c4f)
Signed-off-by: Andrew Or <[email protected]>
commit 48a07490fdd0e79a34e66e5c1baad0b1558bbda5
Author: Daoyuan Wang <[email protected]>
Date: 2014-08-26T05:56:35Z
[Spark-3222] [SQL] Cross join support in HiveQL
We can simple treat cross join as inner join without join conditions.
Author: Daoyuan Wang <[email protected]>
Author: adrian-wang <[email protected]>
Closes #2124 from adrian-wang/crossjoin and squashes the following commits:
8c9b7c5 [Daoyuan Wang] add a test
7d47bbb [adrian-wang] add cross join support for hql
(cherry picked from commit 52fbdc2deddcdba02bf5945a36e15870021ec890)
Signed-off-by: Michael Armbrust <[email protected]>
commit 0f947f1239831a6ed3b47af65816715999bbe57b
Author: Andrew Or <[email protected]>
Date: 2014-08-26T06:36:09Z
[SPARK-2886] Use more specific actor system name than "spark"
As of #1777 we log the name of the actor system when it binds to a port.
The current name "spark" is super general and does not convey any meaning. For
instance, the following line is taken from my driver log after setting
`spark.driver.port` to 5001.
```
14/08/13 19:33:29 INFO Remoting: Remoting started; listening on addresses:
[akka.tcp://sparkandrews-mbp:5001]
14/08/13 19:33:29 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkandrews-mbp:5001]
14/08/06 13:40:05 INFO Utils: Successfully started service 'spark' on port
5001.
```
This commit renames this to "sparkDriver" and "sparkExecutor". The goal of
this unambitious PR is simply to make the logged information more explicit
without introducing any change in functionality.
Author: Andrew Or <[email protected]>
Closes #1810 from andrewor14/service-name and squashes the following
commits:
8c459ed [Andrew Or] Use a common variable for driver/executor actor system
names
3a92843 [Andrew Or] Change actor name to sparkDriver and sparkExecutor
921363e [Andrew Or] Merge branch 'master' of github.com:apache/spark into
service-name
c8c6a62 [Andrew Or] Do not include hyphens in actor name
1c1b42e [Andrew Or] Avoid spaces in akka system name
f644b55 [Andrew Or] Use more specific service name
(cherry picked from commit b21ae5bbb9baa966f69303a30659aa8bbb2098da)
Signed-off-by: Andrew Or <[email protected]>
commit 3a9d874d7a46ab8b015631d91ba479d9a0ba827f
Author: chutium <[email protected]>
Date: 2014-08-26T18:51:26Z
[SPARK-3131][SQL] Allow user to set parquet compression codec for writing
ParquetFile in SQLContext
There are 4 different compression codec available for
```ParquetOutputFormat```
in Spark SQL, it was set as a hard-coded value in
```ParquetRelation.defaultCompression```
original discuss:
https://github.com/apache/spark/pull/195#discussion-diff-11002083
i added a new config property in SQLConf to allow user to change this
compression codec, and i used similar short names syntax as described in
SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)
btw, which codec should we use as default? it was set to GZIP
(https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we
should change this to SNAPPY, since SNAPPY is already the default codec for
shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy
codec natively
(https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).
Author: chutium <[email protected]>
Closes #2039 from chutium/parquet-compression and squashes the following
commits:
2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set
to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name
and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression
codec for writing ParquetFile in SQLContext
(cherry picked from commit 8856c3d86009295be871989a5dc7270f31b420cd)
Signed-off-by: Michael Armbrust <[email protected]>
commit 83d273023b03faa0ceacd69956a132f40d247bc1
Author: Davies Liu <[email protected]>
Date: 2014-08-26T20:04:30Z
[SPARK-2871] [PySpark] add histgram() API
RDD.histogram(buckets)
Compute a histogram using the provided buckets. The buckets
are all open to the right except for the last which is closed.
e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
and 50 we would have a histogram of 1,0,1.
If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
this can be switched from an O(log n) inseration to O(1) per
element(where n = # buckets).
Buckets must be sorted and not contain any duplicates, must be
at least two elements.
If `buckets` is a number, it will generates buckets which is
evenly spaced between the minimum and maximum of the RDD. For
example, if the min value is 0 and the max is 100, given buckets
as 2, the resulting buckets will be [0,50) [50,100]. buckets must
be at least 1 If the RDD contains infinity, NaN throws an exception
If the elements in RDD do not vary (max == min) always returns
a single bucket.
It will return an tuple of buckets and histogram.
>>> rdd = sc.parallelize(range(51))
>>> rdd.histogram(2)
([0, 25, 50], [25, 26])
>>> rdd.histogram([0, 5, 25, 50])
([0, 5, 25, 50], [5, 20, 26])
>>> rdd.histogram([0, 15, 30, 45, 60], True)
([0, 15, 30, 45, 60], [15, 15, 15, 6])
>>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
>>> rdd.histogram(("a", "b", "c"))
(('a', 'b', 'c'), [2, 2])
closes #122, it's duplicated.
Author: Davies Liu <[email protected]>
Closes #2091 from davies/histgram and squashes the following commits:
a322f8a [Davies Liu] fix deprecation of e.message
84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
d9a0722 [Davies Liu] address comments
0e18a2d [Davies Liu] add histgram() API
(cherry picked from commit 3cedc4f4d78e093fd362085e0a077bb9e4f28ca5)
Signed-off-by: Josh Rosen <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]