spark git commit: SPARK-5308 [BUILD] MD5 / SHA1 hash format doesn't match standard Maven output
Repository: spark Updated Branches: refs/heads/master 914267484 - ff356e2a2 SPARK-5308 [BUILD] MD5 / SHA1 hash format doesn't match standard Maven output Here's one way to make the hashes match what Maven's plugins would create. It takes a little extra footwork since OS X doesn't have the same command line tools. An alternative is just to make Maven output these of course - would that be better? I ask in case there is a reason I'm missing, like, we need to hash files that Maven doesn't build. Author: Sean Owen so...@cloudera.com Closes #4161 from srowen/SPARK-5308 and squashes the following commits: 70d09d0 [Sean Owen] Use $(...) syntax e25eff8 [Sean Owen] Generate MD5, SHA1 hashes in a format like Maven's plugin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ff356e2a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ff356e2a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ff356e2a Branch: refs/heads/master Commit: ff356e2a21e31998cda3062e560a276a3bfaa7ab Parents: 9142674 Author: Sean Owen so...@cloudera.com Authored: Tue Jan 27 10:22:50 2015 -0800 Committer: Patrick Wendell patr...@databricks.com Committed: Tue Jan 27 10:22:50 2015 -0800 -- dev/create-release/create-release.sh | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ff356e2a/dev/create-release/create-release.sh -- diff --git a/dev/create-release/create-release.sh b/dev/create-release/create-release.sh index b1b8cb4..b2a7e09 100755 --- a/dev/create-release/create-release.sh +++ b/dev/create-release/create-release.sh @@ -122,8 +122,14 @@ if [[ ! $@ =~ --package-only ]]; then for file in $(find . -type f) do echo $GPG_PASSPHRASE | gpg --passphrase-fd 0 --output $file.asc --detach-sig --armour $file; -gpg --print-md MD5 $file $file.md5; -gpg --print-md SHA1 $file $file.sha1 +if [ $(command -v md5) ]; then + # Available on OS X; -q to keep only hash + md5 -q $file $file.md5 +else + # Available on Linux; cut to keep only hash + md5sum $file | cut -f1 -d' ' $file.md5 +fi +shasum -a 1 $file | cut -f1 -d' ' $file.sha1 done nexus_upload=$NEXUS_ROOT/deployByRepositoryId/$staged_repo_id - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-5419][Mllib] Fix the logic in Vectors.sqdist
Repository: spark Updated Branches: refs/heads/master d6894b1c5 - 7b0ed7979 [SPARK-5419][Mllib] Fix the logic in Vectors.sqdist The current implementation in Vectors.sqdist is not efficient because of allocating temp arrays. There is also a bug in the code `v1.indices.length / v1.size 0.5`. This pr fixes the bug and refactors sqdist without allocating new arrays. Author: Liang-Chi Hsieh vii...@gmail.com Closes #4217 from viirya/fix_sqdist and squashes the following commits: e8b0b3d [Liang-Chi Hsieh] For review comments. 314c424 [Liang-Chi Hsieh] Fix sqdist bug. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7b0ed797 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7b0ed797 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7b0ed797 Branch: refs/heads/master Commit: 7b0ed797958a91cda73baa7aa49ce66bfcb6b64b Parents: d6894b1 Author: Liang-Chi Hsieh vii...@gmail.com Authored: Tue Jan 27 01:29:14 2015 -0800 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Jan 27 01:29:14 2015 -0800 -- .../org/apache/spark/mllib/linalg/Vectors.scala | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7b0ed797/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala index b3022ad..2834ea7 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala @@ -371,18 +371,23 @@ object Vectors { squaredDistance += score * score } - case (v1: SparseVector, v2: DenseVector) if v1.indices.length / v1.size 0.5 = + case (v1: SparseVector, v2: DenseVector) = squaredDistance = sqdist(v1, v2) - case (v1: DenseVector, v2: SparseVector) if v2.indices.length / v2.size 0.5 = + case (v1: DenseVector, v2: SparseVector) = squaredDistance = sqdist(v2, v1) - // When a SparseVector is approximately dense, we treat it as a DenseVector - case (v1, v2) = -squaredDistance = v1.toArray.zip(v2.toArray).foldLeft(0.0){ (distance, elems) = - val score = elems._1 - elems._2 - distance + score * score + case (DenseVector(vv1), DenseVector(vv2)) = +var kv = 0 +val sz = vv1.size +while (kv sz) { + val score = vv1(kv) - vv2(kv) + squaredDistance += score * score + kv += 1 } + case _ = +throw new IllegalArgumentException(Do not support vector type + v1.getClass + + and + v2.getClass) } squaredDistance } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[5/5] spark git commit: [SPARK-5097][SQL] DataFrame
[SPARK-5097][SQL] DataFrame This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities. TODOs: With the exception of Python support, other tasks can be done in separate, follow-up PRs. - [ ] Audit of the API - [ ] Documentation - [ ] More test cases to cover the new API - [x] Python support - [ ] Type alias SchemaRDD Author: Reynold Xin r...@databricks.com Author: Davies Liu dav...@databricks.com Closes #4173 from rxin/df1 and squashes the following commits: 0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1 23b4427 [Reynold Xin] Mima. 828f70d [Reynold Xin] Merge pull request #7 from davies/df 257b9e6 [Davies Liu] add repartition 6bf2b73 [Davies Liu] fix collect with UDT and tests e971078 [Reynold Xin] Missing quotes. b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now. a728bf2 [Reynold Xin] Example rename. e8aa3d3 [Reynold Xin] groupby - groupBy. 9662c9e [Davies Liu] improve DataFrame Python API 4ae51ea [Davies Liu] python API for dataframe 1e5e454 [Reynold Xin] Fixed a bug with symbol conversion. 2ca74db [Reynold Xin] Couple minor fixes. ea98ea1 [Reynold Xin] Documentation literal expressions. 2b22684 [Reynold Xin] Got rid of IntelliJ problems. 02bbfbc [Reynold Xin] Tightening imports. ffbce66 [Reynold Xin] Fixed compilation error. 59b6d8b [Reynold Xin] Style violation. b85edfb [Reynold Xin] ALS. 8c37f0a [Reynold Xin] Made MLlib and examples compile 6d53134 [Reynold Xin] Hive module. d35efd5 [Reynold Xin] Fixed compilation error. ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite. 66d5ef1 [Reynold Xin] SQLContext minor patch. c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles! Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/119f45d6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/119f45d6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/119f45d6 Branch: refs/heads/master Commit: 119f45d61d7b48d376cca05e1b4f0c7fcf65bfa8 Parents: b1b35ca Author: Reynold Xin r...@databricks.com Authored: Tue Jan 27 16:08:24 2015 -0800 Committer: Reynold Xin r...@databricks.com Committed: Tue Jan 27 16:08:24 2015 -0800 -- .../examples/ml/JavaCrossValidatorExample.java | 10 +- .../examples/ml/JavaSimpleParamsExample.java| 12 +- .../JavaSimpleTextClassificationPipeline.java | 10 +- .../apache/spark/examples/sql/JavaSparkSQL.java | 36 +- .../src/main/python/mllib/dataset_example.py| 2 +- examples/src/main/python/sql.py | 16 +- .../examples/ml/CrossValidatorExample.scala | 3 +- .../apache/spark/examples/ml/MovieLensALS.scala | 2 +- .../spark/examples/ml/SimpleParamsExample.scala | 5 +- .../ml/SimpleTextClassificationPipeline.scala | 3 +- .../spark/examples/mllib/DatasetExample.scala | 28 +- .../apache/spark/examples/sql/RDDRelation.scala | 6 +- .../scala/org/apache/spark/ml/Estimator.scala | 8 +- .../scala/org/apache/spark/ml/Evaluator.scala | 4 +- .../scala/org/apache/spark/ml/Pipeline.scala| 6 +- .../scala/org/apache/spark/ml/Transformer.scala | 17 +- .../ml/classification/LogisticRegression.scala | 14 +- .../BinaryClassificationEvaluator.scala | 7 +- .../spark/ml/feature/StandardScaler.scala | 15 +- .../apache/spark/ml/recommendation/ALS.scala| 37 +- .../apache/spark/ml/tuning/CrossValidator.scala | 8 +- .../org/apache/spark/ml/JavaPipelineSuite.java | 6 +- .../JavaLogisticRegressionSuite.java| 8 +- .../ml/tuning/JavaCrossValidatorSuite.java | 4 +- .../org/apache/spark/ml/PipelineSuite.scala | 14 +- .../LogisticRegressionSuite.scala | 16 +- .../spark/ml/recommendation/ALSSuite.scala | 4 +- .../spark/ml/tuning/CrossValidatorSuite.scala | 4 +- project/MimaExcludes.scala | 15 +- python/pyspark/java_gateway.py | 7 +- python/pyspark/sql.py | 967 ++- python/pyspark/tests.py | 155 +-- .../analysis/MultiInstanceRelation.scala| 2 +- .../catalyst/expressions/namedExpressions.scala | 3 + .../spark/sql/catalyst/plans/joinTypes.scala| 15 + .../catalyst/plans/logical/TestRelation.scala | 8 +- .../org/apache/spark/sql/CacheManager.scala | 8 +- .../scala/org/apache/spark/sql/Column.scala | 528 ++ .../scala/org/apache/spark/sql/DataFrame.scala | 596 .../org/apache/spark/sql/GroupedDataFrame.scala | 139 +++ .../scala/org/apache/spark/sql/Literal.scala| 98 ++ .../scala/org/apache/spark/sql/SQLContext.scala | 85 +- .../scala/org/apache/spark/sql/SchemaRDD.scala | 511 -- .../org/apache/spark/sql/SchemaRDDLike.scala| 139 --- .../main/scala/org/apache/spark/sql/api.scala | 289 ++
[1/5] spark git commit: [SPARK-5097][SQL] DataFrame
Repository: spark Updated Branches: refs/heads/master b1b35ca2e - 119f45d61 http://git-wip-us.apache.org/repos/asf/spark/blob/119f45d6/sql/core/src/test/scala/org/apache/spark/sql/execution/TgfSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/TgfSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/TgfSuite.scala deleted file mode 100644 index 272c0d4..000 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/TgfSuite.scala +++ /dev/null @@ -1,65 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the License); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.execution - -import org.apache.spark.sql.QueryTest -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.plans._ - -/* Implicit conversions */ -import org.apache.spark.sql.test.TestSQLContext._ - -/** - * This is an example TGF that uses UnresolvedAttributes 'name and 'age to access specific columns - * from the input data. These will be replaced during analysis with specific AttributeReferences - * and then bound to specific ordinals during query planning. While TGFs could also access specific - * columns using hand-coded ordinals, doing so violates data independence. - * - * Note: this is only a rough example of how TGFs can be expressed, the final version will likely - * involve a lot more sugar for cleaner use in Scala/Java/etc. - */ -case class ExampleTGF(input: Seq[Expression] = Seq('name, 'age)) extends Generator { - def children = input - protected def makeOutput() = 'nameAndAge.string :: Nil - - val Seq(nameAttr, ageAttr) = input - - override def eval(input: Row): TraversableOnce[Row] = { -val name = nameAttr.eval(input) -val age = ageAttr.eval(input).asInstanceOf[Int] - -Iterator( - new GenericRow(Array[Any](s$name is $age years old)), - new GenericRow(Array[Any](sNext year, $name will be ${age + 1} years old))) - } -} - -class TgfSuite extends QueryTest { - val inputData = -logical.LocalRelation('name.string, 'age.int).loadData( - (michael, 29) :: Nil -) - - test(simple tgf example) { -checkAnswer( - inputData.generate(ExampleTGF()), - Seq( -Row(michael is 29 years old), -Row(Next year, michael will be 30 years old))) - } -} http://git-wip-us.apache.org/repos/asf/spark/blob/119f45d6/sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala index 94d14ac..ef198f8 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala @@ -21,11 +21,12 @@ import java.sql.{Date, Timestamp} import org.apache.spark.sql.TestData._ import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.dsl._ import org.apache.spark.sql.json.JsonRDD.{compatibleType, enforceCorrectType} import org.apache.spark.sql.test.TestSQLContext import org.apache.spark.sql.test.TestSQLContext._ import org.apache.spark.sql.types._ -import org.apache.spark.sql.{QueryTest, Row, SQLConf} +import org.apache.spark.sql.{Literal, QueryTest, Row, SQLConf} class JsonSuite extends QueryTest { import org.apache.spark.sql.json.TestJsonData._ @@ -463,8 +464,8 @@ class JsonSuite extends QueryTest { // in the Project. checkAnswer( jsonSchemaRDD. -where('num_str BigDecimal(92233720368547758060)). -select('num_str + 1.2 as Symbol(num)), +where('num_str Literal(BigDecimal(92233720368547758060))). +select(('num_str + Literal(1.2)).as(num)), Row(new java.math.BigDecimal(92233720368547758061.2)) ) @@ -820,7 +821,7 @@ class JsonSuite extends QueryTest { val schemaRDD1 = applySchema(rowRDD1, schema1) schemaRDD1.registerTempTable(applySchema1) -val schemaRDD2 = schemaRDD1.toSchemaRDD +val schemaRDD2 = schemaRDD1.toDF val result = schemaRDD2.toJSON.collect() assert(result(0) ==
[4/5] spark git commit: [SPARK-5097][SQL] DataFrame
http://git-wip-us.apache.org/repos/asf/spark/blob/119f45d6/python/pyspark/sql.py -- diff --git a/python/pyspark/sql.py b/python/pyspark/sql.py index 1990323..7d7550c 100644 --- a/python/pyspark/sql.py +++ b/python/pyspark/sql.py @@ -20,15 +20,19 @@ public classes of Spark SQL: - L{SQLContext} Main entry point for SQL functionality. -- L{SchemaRDD} +- L{DataFrame} A Resilient Distributed Dataset (RDD) with Schema information for the data contained. In - addition to normal RDD operations, SchemaRDDs also support SQL. + addition to normal RDD operations, DataFrames also support SQL. +- L{GroupedDataFrame} +- L{Column} + Column is a DataFrame with a single column. - L{Row} A Row of data returned by a Spark SQL query. - L{HiveContext} Main entry point for accessing data stored in Apache Hive.. +import sys import itertools import decimal import datetime @@ -36,6 +40,9 @@ import keyword import warnings import json import re +import random +import os +from tempfile import NamedTemporaryFile from array import array from operator import itemgetter from itertools import imap @@ -43,6 +50,7 @@ from itertools import imap from py4j.protocol import Py4JError from py4j.java_collections import ListConverter, MapConverter +from pyspark.context import SparkContext from pyspark.rdd import RDD from pyspark.serializers import BatchedSerializer, AutoBatchedSerializer, PickleSerializer, \ CloudPickleSerializer, UTF8Deserializer @@ -54,7 +62,8 @@ __all__ = [ StringType, BinaryType, BooleanType, DateType, TimestampType, DecimalType, DoubleType, FloatType, ByteType, IntegerType, LongType, ShortType, ArrayType, MapType, StructField, StructType, -SQLContext, HiveContext, SchemaRDD, Row] +SQLContext, HiveContext, DataFrame, GroupedDataFrame, Column, Row, +SchemaRDD] class DataType(object): @@ -1171,7 +1180,7 @@ def _create_cls(dataType): class Row(tuple): - Row in SchemaRDD + Row in DataFrame __DATATYPE__ = dataType __FIELDS__ = tuple(f.name for f in dataType.fields) __slots__ = () @@ -1198,7 +1207,7 @@ class SQLContext(object): Main entry point for Spark SQL functionality. -A SQLContext can be used create L{SchemaRDD}, register L{SchemaRDD} as +A SQLContext can be used create L{DataFrame}, register L{DataFrame} as tables, execute SQL over tables, cache tables, and read parquet files. @@ -1209,8 +1218,8 @@ class SQLContext(object): :param sqlContext: An optional JVM Scala SQLContext. If set, we do not instatiate a new SQLContext in the JVM, instead we make all calls to this object. - srdd = sqlCtx.inferSchema(rdd) - sqlCtx.inferSchema(srdd) # doctest: +IGNORE_EXCEPTION_DETAIL + df = sqlCtx.inferSchema(rdd) + sqlCtx.inferSchema(df) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): ... TypeError:... @@ -1225,12 +1234,12 @@ class SQLContext(object): allTypes = sc.parallelize([Row(i=1, s=string, d=1.0, l=1L, ... b=True, list=[1, 2, 3], dict={s: 0}, row=Row(a=1), ... time=datetime(2014, 8, 1, 14, 1, 5))]) - srdd = sqlCtx.inferSchema(allTypes) - srdd.registerTempTable(allTypes) + df = sqlCtx.inferSchema(allTypes) + df.registerTempTable(allTypes) sqlCtx.sql('select i+1, d+1, not b, list[1], dict[s], time, row.a ' ...'from allTypes where b and i 0').collect() [Row(c0=2, c1=2.0, c2=False, c3=2, c4=0...8, 1, 14, 1, 5), a=1)] - srdd.map(lambda x: (x.i, x.s, x.d, x.l, x.b, x.time, + df.map(lambda x: (x.i, x.s, x.d, x.l, x.b, x.time, ... x.row.a, x.list)).collect() [(1, u'string', 1.0, 1, True, ...(2014, 8, 1, 14, 1, 5), 1, [1, 2, 3])] @@ -1309,23 +1318,23 @@ class SQLContext(object): ... [Row(field1=1, field2=row1), ... Row(field1=2, field2=row2), ... Row(field1=3, field2=row3)]) - srdd = sqlCtx.inferSchema(rdd) - srdd.collect()[0] + df = sqlCtx.inferSchema(rdd) + df.collect()[0] Row(field1=1, field2=u'row1') NestedRow = Row(f1, f2) nestedRdd1 = sc.parallelize([ ... NestedRow(array('i', [1, 2]), {row1: 1.0}), ... NestedRow(array('i', [2, 3]), {row2: 2.0})]) - srdd = sqlCtx.inferSchema(nestedRdd1) - srdd.collect() + df = sqlCtx.inferSchema(nestedRdd1) + df.collect() [Row(f1=[1, 2], f2={u'row1': 1.0}), ..., f2={u'row2': 2.0})] nestedRdd2 = sc.parallelize([ ... NestedRow([[1, 2], [2, 3]], [1, 2]), ... NestedRow([[2, 3], [3, 4]], [2, 3])]) - srdd = sqlCtx.inferSchema(nestedRdd2) -
[3/5] spark git commit: [SPARK-5097][SQL] DataFrame
http://git-wip-us.apache.org/repos/asf/spark/blob/119f45d6/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala new file mode 100644 index 000..d0bb364 --- /dev/null +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala @@ -0,0 +1,596 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag +import scala.collection.JavaConversions._ + +import java.util.{ArrayList, List = JList} + +import com.fasterxml.jackson.core.JsonFactory +import net.razorvine.pickle.Pickler + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.api.python.SerDeUtil +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.{LogicalRDD, EvaluatePython} +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} +import org.apache.spark.util.Utils + + +/** + * A collection of rows that have the same columns. + * + * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and can be created using + * various functions in [[SQLContext]]. + * {{{ + * val people = sqlContext.parquetFile(...) + * }}} + * + * Once created, it can be manipulated using the various domain-specific-language (DSL) functions + * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for Scala DSL. + * + * To select a column from the data frame, use the apply method: + * {{{ + * val ageCol = people(age) // in Scala + * Column ageCol = people.apply(age) // in Java + * }}} + * + * Note that the [[Column]] type can also be manipulated through its various functions. + * {{ + * // The following creates a new column that increases everybody's age by 10. + * people(age) + 10 // in Scala + * }} + * + * A more concrete example: + * {{{ + * // To create DataFrame using SQLContext + * val people = sqlContext.parquetFile(...) + * val department = sqlContext.parquetFile(...) + * + * people.filter(age 30) + * .join(department, people(deptId) === department(id)) + * .groupBy(department(name), gender) + * .agg(avg(people(salary)), max(people(age))) + * }}} + */ +// TODO: Improve documentation. +class DataFrame protected[sql]( +val sqlContext: SQLContext, +private val baseLogicalPlan: LogicalPlan, +operatorsEnabled: Boolean) + extends DataFrameSpecificApi with RDDApi[Row] { + + protected[sql] def this(sqlContext: Option[SQLContext], plan: Option[LogicalPlan]) = +this(sqlContext.orNull, plan.orNull, sqlContext.isDefined plan.isDefined) + + protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = this(sqlContext, plan, true) + + @transient protected[sql] lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan) + + @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan match { +// For various commands (like DDL) and queries with side effects, we force query optimization to +// happen right away to let these side effects take place eagerly. +case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: WriteToFile = + LogicalRDD(queryExecution.analyzed.output, queryExecution.toRdd)(sqlContext) +case _ = + baseLogicalPlan + } + + /** + * An implicit conversion function internal to this class for us to avoid doing + * new DataFrame(...) everywhere. + */ + private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): DataFrame = { +new DataFrame(sqlContext, logicalPlan, true) + } + + /** Return the list of numeric columns, useful
spark git commit: SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs
Repository: spark Updated Branches: refs/heads/master fdaad4eb0 - b1b35ca2e SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs ...mbineFileSplits Author: Sandy Ryza sa...@cloudera.com Closes #4050 from sryza/sandy-spark-5199 and squashes the following commits: 864514b [Sandy Ryza] Add tests and fix bug 0d504f1 [Sandy Ryza] Prettify 915c7e6 [Sandy Ryza] Get metrics from all filesystems cdbc3e8 [Sandy Ryza] SPARK-5199. Input metrics should show up for InputFormats that return CombineFileSplits Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b1b35ca2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b1b35ca2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b1b35ca2 Branch: refs/heads/master Commit: b1b35ca2e440df40b253bf967bb93705d355c1c0 Parents: fdaad4e Author: Sandy Ryza sa...@cloudera.com Authored: Tue Jan 27 15:42:55 2015 -0800 Committer: Patrick Wendell patr...@databricks.com Committed: Tue Jan 27 15:42:55 2015 -0800 -- .../apache/spark/deploy/SparkHadoopUtil.scala | 16 ++-- .../org/apache/spark/executor/TaskMetrics.scala | 1 - .../scala/org/apache/spark/rdd/HadoopRDD.scala | 12 ++- .../org/apache/spark/rdd/NewHadoopRDD.scala | 15 +-- .../org/apache/spark/rdd/PairRDDFunctions.scala | 11 +-- .../spark/metrics/InputOutputMetricsSuite.scala | 97 +++- 6 files changed, 120 insertions(+), 32 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b1b35ca2/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 57f9faf..211e3ed 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -133,10 +133,9 @@ class SparkHadoopUtil extends Logging { * statistics are only available as of Hadoop 2.5 (see HADOOP-10688). * Returns None if the required method can't be found. */ - private[spark] def getFSBytesReadOnThreadCallback(path: Path, conf: Configuration) -: Option[() = Long] = { + private[spark] def getFSBytesReadOnThreadCallback(): Option[() = Long] = { try { - val threadStats = getFileSystemThreadStatistics(path, conf) + val threadStats = getFileSystemThreadStatistics() val getBytesReadMethod = getFileSystemThreadStatisticsMethod(getBytesRead) val f = () = threadStats.map(getBytesReadMethod.invoke(_).asInstanceOf[Long]).sum val baselineBytesRead = f() @@ -156,10 +155,9 @@ class SparkHadoopUtil extends Logging { * statistics are only available as of Hadoop 2.5 (see HADOOP-10688). * Returns None if the required method can't be found. */ - private[spark] def getFSBytesWrittenOnThreadCallback(path: Path, conf: Configuration) -: Option[() = Long] = { + private[spark] def getFSBytesWrittenOnThreadCallback(): Option[() = Long] = { try { - val threadStats = getFileSystemThreadStatistics(path, conf) + val threadStats = getFileSystemThreadStatistics() val getBytesWrittenMethod = getFileSystemThreadStatisticsMethod(getBytesWritten) val f = () = threadStats.map(getBytesWrittenMethod.invoke(_).asInstanceOf[Long]).sum val baselineBytesWritten = f() @@ -172,10 +170,8 @@ class SparkHadoopUtil extends Logging { } } - private def getFileSystemThreadStatistics(path: Path, conf: Configuration): Seq[AnyRef] = { -val qualifiedPath = path.getFileSystem(conf).makeQualified(path) -val scheme = qualifiedPath.toUri().getScheme() -val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme)) + private def getFileSystemThreadStatistics(): Seq[AnyRef] = { +val stats = FileSystem.getAllStatistics() stats.map(Utils.invoke(classOf[Statistics], _, getThreadStatistics)) } http://git-wip-us.apache.org/repos/asf/spark/blob/b1b35ca2/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala -- diff --git a/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala b/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala index ddb5903..97912c6 100644 --- a/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala +++ b/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala @@ -19,7 +19,6 @@ package org.apache.spark.executor import java.util.concurrent.atomic.AtomicLong -import org.apache.spark.executor.DataReadMethod import org.apache.spark.executor.DataReadMethod.DataReadMethod import scala.collection.mutable.ArrayBuffer
[1/2] spark git commit: Preparing development version 1.2.2-SNAPSHOT
Repository: spark Updated Branches: refs/heads/branch-1.2 4026bba79 - 0a16abadc Preparing development version 1.2.2-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0a16abad Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0a16abad Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0a16abad Branch: refs/heads/branch-1.2 Commit: 0a16abadc59082b7d3a24d7f3625236658632813 Parents: b77f876 Author: Patrick Wendell patr...@databricks.com Authored: Wed Jan 28 07:48:55 2015 + Committer: Patrick Wendell patr...@databricks.com Committed: Wed Jan 28 07:48:55 2015 + -- assembly/pom.xml | 2 +- bagel/pom.xml | 2 +- core/pom.xml | 2 +- examples/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka/pom.xml| 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml | 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml| 2 +- extras/kinesis-asl/pom.xml| 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml| 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml | 2 +- pom.xml | 2 +- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/alpha/pom.xml| 2 +- yarn/pom.xml | 2 +- yarn/stable/pom.xml | 2 +- 29 files changed, 29 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0a16abad/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index d731003..6889a6c 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1/version +version1.2.2-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/0a16abad/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 8374612..f785cf6 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1/version +version1.2.2-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/0a16abad/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index ceeabd6..9e202f3 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1/version +version1.2.2-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/0a16abad/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index c7ad4dc..df6975f 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1/version +version1.2.2-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/0a16abad/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml index e0b3eae..0002bf2 100644 --- a/external/flume-sink/pom.xml +++ b/external/flume-sink/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1/version +version1.2.2-SNAPSHOT/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/0a16abad/external/flume/pom.xml -- diff --git a/external/flume/pom.xml b/external/flume/pom.xml index f9559c1..e783d39 100644 --- a/external/flume/pom.xml +++ b/external/flume/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1/version +
[1/2] spark git commit: Revert Preparing development version 1.2.2-SNAPSHOT
Repository: spark Updated Branches: refs/heads/branch-1.2 fea9b43ef - 4026bba79 Revert Preparing development version 1.2.2-SNAPSHOT This reverts commit f53a4319ba5f0843c077e64ae5a41e2fac835a5b. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/063a4c50 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/063a4c50 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/063a4c50 Branch: refs/heads/branch-1.2 Commit: 063a4c5037812739ce4a6370543c6e54d8956104 Parents: fea9b43 Author: Patrick Wendell patr...@databricks.com Authored: Tue Jan 27 23:46:57 2015 -0800 Committer: Patrick Wendell patr...@databricks.com Committed: Tue Jan 27 23:46:57 2015 -0800 -- assembly/pom.xml | 2 +- bagel/pom.xml | 2 +- core/pom.xml | 2 +- examples/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka/pom.xml| 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml | 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml| 2 +- extras/kinesis-asl/pom.xml| 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml| 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml | 2 +- pom.xml | 2 +- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/alpha/pom.xml| 2 +- yarn/pom.xml | 2 +- yarn/stable/pom.xml | 2 +- 29 files changed, 29 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/063a4c50/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 6889a6c..d731003 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.2-SNAPSHOT/version +version1.2.1/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/063a4c50/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index f785cf6..8374612 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.2-SNAPSHOT/version +version1.2.1/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/063a4c50/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 9e202f3..ceeabd6 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.2-SNAPSHOT/version +version1.2.1/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/063a4c50/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index df6975f..c7ad4dc 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.2-SNAPSHOT/version +version1.2.1/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/063a4c50/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml index 0002bf2..e0b3eae 100644 --- a/external/flume-sink/pom.xml +++ b/external/flume-sink/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.2-SNAPSHOT/version +version1.2.1/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/063a4c50/external/flume/pom.xml -- diff --git a/external/flume/pom.xml b/external/flume/pom.xml index e783d39..f9559c1 100644 --- a/external/flume/pom.xml +++ b/external/flume/pom.xml @@ -21,7 +21,7 @@ parent
Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.2.1-rc2 [created] b77f87673 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: Preparing Spark release v1.2.1-rc2
Preparing Spark release v1.2.1-rc2 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b77f8767 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b77f8767 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b77f8767 Branch: refs/heads/branch-1.2 Commit: b77f87673d1f9f03d4c83cf583158227c551359b Parents: 4026bba Author: Patrick Wendell patr...@databricks.com Authored: Wed Jan 28 07:48:55 2015 + Committer: Patrick Wendell patr...@databricks.com Committed: Wed Jan 28 07:48:55 2015 + -- assembly/pom.xml | 2 +- bagel/pom.xml | 2 +- core/pom.xml | 2 +- examples/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka/pom.xml| 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml | 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml| 2 +- extras/kinesis-asl/pom.xml| 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml| 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml | 2 +- pom.xml | 2 +- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/alpha/pom.xml| 2 +- yarn/pom.xml | 2 +- yarn/stable/pom.xml | 2 +- 29 files changed, 29 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b77f8767/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 65e3ddf..d731003 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1-SNAPSHOT/version +version1.2.1/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/b77f8767/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 4ead7aa..8374612 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1-SNAPSHOT/version +version1.2.1/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/b77f8767/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 155b4c9..ceeabd6 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1-SNAPSHOT/version +version1.2.1/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/b77f8767/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index f5a7ed2..c7ad4dc 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1-SNAPSHOT/version +version1.2.1/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/b77f8767/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml index fe1c8fb..e0b3eae 100644 --- a/external/flume-sink/pom.xml +++ b/external/flume-sink/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1-SNAPSHOT/version +version1.2.1/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/b77f8767/external/flume/pom.xml -- diff --git a/external/flume/pom.xml b/external/flume/pom.xml index da4bd70..f9559c1 100644 --- a/external/flume/pom.xml +++ b/external/flume/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.2.1-SNAPSHOT/version +version1.2.1/version relativePath../../pom.xml/relativePath /parent