GitHub user rxin opened a pull request:
https://github.com/apache/spark/pull/1631
[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rxin/spark sortOrder
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1631.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1631
----
commit 2b8d89e30ebfe2272229a1eddd7542d7437c9924
Author: Cheng Hao <[email protected]>
Date: 2014-07-28T17:59:53Z
[SPARK-2523] [SQL] Hadoop table scan bug fixing
In HiveTableScan.scala, ObjectInspector was created for all of the
partition based records, which probably causes ClassCastException if the object
inspector is not identical among table & partitions.
This is the follow up with:
https://github.com/apache/spark/pull/1408
https://github.com/apache/spark/pull/1390
I've run a micro benchmark in my local with 15000000 records totally, and
got the result as below:
With This Patch | Partition-Based Table | Non-Partition-Based Table
------------ | ------------- | -------------
No | 1927 ms | 1885 ms
Yes | 1541 ms | 1524 ms
It showed this patch will also improve the performance.
PS: the benchmark code is also attached. (thanks liancheng )
```
package org.apache.spark.sql.hive
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object HiveTableScanPrepare extends App {
case class Record(key: String, value: String)
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"$i",
s"val_$i")))
import hiveContext._
hql("SHOW TABLES")
hql("DROP TABLE if exists part_scan_test")
hql("DROP TABLE if exists scan_test")
hql("DROP TABLE if exists records")
rdd.registerAsTable("records")
hql("""CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED
BY (part1 string, part2 STRING)
| ROW FORMAT SERDE
|
'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
| STORED AS RCFILE
""".stripMargin)
hql("""CREATE TABLE scan_test (key STRING, value STRING)
| ROW FORMAT SERDE
|
'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
| STORED AS RCFILE
""".stripMargin)
for (part1 <- 2000 until 2001) {
for (part2 <- 1 to 5) {
hql(s"""from records
| insert into table part_scan_test PARTITION
(part1='$part1', part2='2010-01-$part2')
| select key, value
""".stripMargin)
hql(s"""from records
| insert into table scan_test select key, value
""".stripMargin)
}
}
}
object HiveTableScanTest extends App {
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
import hiveContext._
hql("SHOW TABLES")
val part_scan_test = hql("select key, value from part_scan_test")
val scan_test = hql("select key, value from scan_test")
val r_part_scan_test = (0 to 5).map(i => benchmark(part_scan_test))
val r_scan_test = (0 to 5).map(i => benchmark(scan_test))
println("Scanning Partition-Based Table")
r_part_scan_test.foreach(printResult)
println("Scanning Non-Partition-Based Table")
r_scan_test.foreach(printResult)
def printResult(result: (Long, Long)) {
println(s"Duration: ${result._1} ms Result: ${result._2}")
}
def benchmark(srdd: SchemaRDD) = {
val begin = System.currentTimeMillis()
val result = srdd.count()
val end = System.currentTimeMillis()
((end - begin), result)
}
}
```
Author: Cheng Hao <[email protected]>
Closes #1439 from chenghao-intel/hadoop_table_scan and squashes the
following commits:
888968f [Cheng Hao] Fix issues in code style
27540ba [Cheng Hao] Fix the TableScan Bug while partition serde differs
40a24a7 [Cheng Hao] Add Unit Test
commit 255b56f9f530e8594a7e6055ae07690454c66799
Author: DB Tsai <[email protected]>
Date: 2014-07-28T18:34:19Z
[SPARK-2479][MLlib] Comparing floating-point numbers using relative error
in UnitTests
Floating point math is not exact, and most floating-point numbers end up
being slightly imprecise due to rounding errors.
Simple values like 0.1 cannot be precisely represented using binary
floating point numbers, and the limited precision of floating point numbers
means that slight changes in the order of operations or the precision of
intermediates can change the result.
That means that comparing two floats to see if they are equal is usually
not what we want. As long as this imprecision stays small, it can usually be
ignored.
Based on discussion in the community, we have implemented two different
APIs for relative tolerance, and absolute tolerance. It makes sense that test
writers should know which one they need depending on their circumstances.
Developers also need to explicitly specify the eps, and there is no default
value which will sometimes cause confusion.
When comparing against zero using relative tolerance, a exception will be
raised to warn users that it's meaningless.
For relative tolerance, users can now write
assert(23.1 ~== 23.52 relTol 0.02)
assert(23.1 ~== 22.74 relTol 0.02)
assert(23.1 ~= 23.52 relTol 0.02)
assert(23.1 ~= 22.74 relTol 0.02)
assert(!(23.1 !~= 23.52 relTol 0.02))
assert(!(23.1 !~= 22.74 relTol 0.02))
// This will throw exception with the following message.
// "Did not expect 23.1 and 23.52 to be within 0.02 using relative
tolerance."
assert(23.1 !~== 23.52 relTol 0.02)
// "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance."
assert(23.1 ~== 22.34 relTol 0.02)
For absolute error,
assert(17.8 ~== 17.99 absTol 0.2)
assert(17.8 ~== 17.61 absTol 0.2)
assert(17.8 ~= 17.99 absTol 0.2)
assert(17.8 ~= 17.61 absTol 0.2)
assert(!(17.8 !~= 17.99 absTol 0.2))
assert(!(17.8 !~= 17.61 absTol 0.2))
// This will throw exception with the following message.
// "Did not expect 17.8 and 17.99 to be within 0.2 using absolute
error."
assert(17.8 !~== 17.99 absTol 0.2)
// "Expected 17.8 and 17.59 to be within 0.2 using absolute error."
assert(17.8 ~== 17.59 absTol 0.2)
Authors:
DB Tsai <dbtsaialpinenow.com>
Marek Kolodziej <marekalpinenow.com>
Author: DB Tsai <[email protected]>
Closes #1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes
the following commits:
8c7cbcc [DB Tsai] Alpine Data Labs
commit a7a9d14479ea6421513a962ff0f45cb969368bab
Author: Cheng Lian <[email protected]>
Date: 2014-07-28T19:07:30Z
[SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
Another try for #1399 & #1600. Those two PR breaks Jenkins builds because
we made a separate profile `hive-thriftserver` in sub-project `assembly`, but
the `hive-thriftserver` module is defined outside the `hive-thriftserver`
profile. Thus every time a pull request that doesn't touch SQL code will also
execute test suites defined in `hive-thriftserver`, but tests fail because
related .class files are not included in the assembly jar.
In the most recent commit, module `hive-thriftserver` is moved into its own
profile to fix this problem. All previous commits are squashed for clarity.
Author: Cheng Lian <[email protected]>
Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following
commits:
629988e [Cheng Lian] Moved hive-thriftserver module definition into its own
profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
commit 39ab87b924ad65b6b9b7aa6831f3e9ddc2b76dd7
Author: Aaron Davidson <[email protected]>
Date: 2014-07-28T20:37:44Z
Use commons-lang3 in SignalLogger rather than commons-lang
Spark only transitively depends on the latter, based on the Hadoop version.
Author: Aaron Davidson <[email protected]>
Closes #1621 from aarondav/lang3 and squashes the following commits:
93c93bf [Aaron Davidson] Use commons-lang3 in SignalLogger rather than
commons-lang
commit 16ef4d110f15dfe66852802fdadfe2ed7574ddc2
Author: Yadong Qi <[email protected]>
Date: 2014-07-29T04:39:02Z
Excess judgment
Author: Yadong Qi <[email protected]>
Closes #1629 from watermen/bug-fix2 and squashes the following commits:
59b7237 [Yadong Qi] Update HiveQl.scala
commit c9d37e1bacaff2be9ee9174a2965fdc2e9a04245
Author: Reynold Xin <[email protected]>
Date: 2014-07-29T05:15:05Z
[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---