Repository: spark Updated Branches: refs/heads/branch-2.4 71a6a9ce8 -> 715355164
http://git-wip-us.apache.org/repos/asf/spark/blob/71535516/docs/sql-pyspark-pandas-with-arrow.md ---------------------------------------------------------------------- diff --git a/docs/sql-pyspark-pandas-with-arrow.md b/docs/sql-pyspark-pandas-with-arrow.md new file mode 100644 index 0000000..e8e9f55 --- /dev/null +++ b/docs/sql-pyspark-pandas-with-arrow.md @@ -0,0 +1,166 @@ +--- +layout: global +title: PySpark Usage Guide for Pandas with Apache Arrow +displayTitle: PySpark Usage Guide for Pandas with Apache Arrow +--- + +* Table of contents +{:toc} + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame +using the call `toPandas()` and when creating a Spark DataFrame from a Pandas DataFrame with +`createDataFrame(pandas_df)`. To use Arrow when executing these calls, users need to first set +the Spark configuration 'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + +In addition, optimizations enabled by 'spark.sql.execution.arrow.enabled' could fallback automatically +to non-Arrow optimization implementation if an error occurs before the actual computation within Spark. +This can be controlled by 'spark.sql.execution.arrow.fallback.enabled'. + +<div class="codetabs"> +<div data-lang="python" markdown="1"> +{% include_example dataframe_with_arrow python/sql/arrow.py %} +</div> +</div> + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported SQL Types](#supported-sql-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Grouped Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + +<div class="codetabs"> +<div data-lang="python" markdown="1"> +{% include_example scalar_pandas_udf python/sql/arrow.py %} +</div> +</div> + +### Grouped Map +Grouped map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. +Split-apply-combine consists of three steps: +* Split the data into groups by using `DataFrame.groupBy`. +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The + input data contains all the rows and columns for each group. +* Combine the results into a new `DataFrame`. + +To use `groupBy().apply()`, the user needs to define the following: +* A Python function that defines the computation for each group. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + +The column labels of the returned `pandas.DataFrame` must either match the field names in the +defined output schema if specified as strings, or match the field data types by position if not +strings, e.g. integer indices. See [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) +on how to label columns when constructing a `pandas.DataFrame`. + +Note that all data for a group will be loaded into memory before the function is applied. This can +lead to out of memory exceptions, especially if the group sizes are skewed. The configuration for +[maxRecordsPerBatch](#setting-arrow-batch-size) is not applied on groups and it is up to the user +to ensure that the grouped data will fit into the available memory. + +The following example shows how to use `groupby().apply()` to subtract the mean from each value in the group. + +<div class="codetabs"> +<div data-lang="python" markdown="1"> +{% include_example grouped_map_pandas_udf python/sql/arrow.py %} +</div> +</div> + +For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf) and +[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply). + +### Grouped Aggregate + +Grouped aggregate Pandas UDFs are similar to Spark aggregate functions. Grouped aggregate Pandas UDFs are used with `groupBy().agg()` and +[`pyspark.sql.Window`](api/python/pyspark.sql.html#pyspark.sql.Window). It defines an aggregation from one or more `pandas.Series` +to a scalar value, where each `pandas.Series` represents a column within the group or window. + +Note that this type of UDF does not support partial aggregation and all data for a group or window will be loaded into memory. Also, +only unbounded window is supported with Grouped aggregate Pandas UDFs currently. + +The following example shows how to use this type of UDF to compute mean with groupBy and window operations: + +<div class="codetabs"> +<div data-lang="python" markdown="1"> +{% include_example grouped_agg_pandas_udf python/sql/arrow.py %} +</div> +</div> + +For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf) + +## Usage Notes + +### Supported SQL Types + +Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`, +`ArrayType` of `TimestampType`, and nested `StructType`. + +### Setting Arrow Batch Size + +Data partitions in Spark are converted into Arrow record batches, which can temporarily lead to +high memory usage in the JVM. To avoid possible out of memory exceptions, the size of the Arrow +record batches can be adjusted by setting the conf "spark.sql.execution.arrow.maxRecordsPerBatch" +to an integer that will determine the maximum number of rows for each batch. The default value is +10,000 records per batch. If the number of columns is large, the value should be adjusted +accordingly. Using this limit, each data partition will be made into 1 or more record batches for +processing. + +### Timestamp with Time Zone Semantics + +Spark internally stores timestamps as UTC values, and timestamp data that is brought in without +a specified time zone is converted as local time to UTC with microsecond resolution. When timestamp +data is exported or displayed in Spark, the session time zone is used to localize the timestamp +values. The session time zone is set with the configuration 'spark.sql.session.timeZone' and will +default to the JVM system local time zone if not set. Pandas uses a `datetime64` type with nanosecond +resolution, `datetime64[ns]`, with optional time zone on a per-column basis. + +When timestamp data is transferred from Spark to Pandas it will be converted to nanoseconds +and each column will be converted to the Spark session time zone then localized to that time +zone, which removes the time zone and displays values as local time. This will occur +when calling `toPandas()` or `pandas_udf` with timestamp columns. + +When timestamp data is transferred from Pandas to Spark, it will be converted to UTC microseconds. This +occurs when calling `createDataFrame` with a Pandas DataFrame or when returning a timestamp from a +`pandas_udf`. These conversions are done automatically to ensure Spark will have data in the +expected format, so it is not necessary to do any of these conversions yourself. Any nanosecond +values will be truncated. + +Note that a standard UDF (non-Pandas) will load timestamp data as Python datetime objects, which is +different than a Pandas timestamp. It is recommended to use Pandas time series functionality when +working with timestamps in `pandas_udf`s to get the best performance, see +[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. http://git-wip-us.apache.org/repos/asf/spark/blob/71535516/docs/sql-reference.md ---------------------------------------------------------------------- diff --git a/docs/sql-reference.md b/docs/sql-reference.md new file mode 100644 index 0000000..9e4239b --- /dev/null +++ b/docs/sql-reference.md @@ -0,0 +1,641 @@ +--- +layout: global +title: Reference +displayTitle: Reference +--- + +* Table of contents +{:toc} + +## Data Types + +Spark SQL and DataFrames support the following data types: + +* Numeric types + - `ByteType`: Represents 1-byte signed integer numbers. + The range of numbers is from `-128` to `127`. + - `ShortType`: Represents 2-byte signed integer numbers. + The range of numbers is from `-32768` to `32767`. + - `IntegerType`: Represents 4-byte signed integer numbers. + The range of numbers is from `-2147483648` to `2147483647`. + - `LongType`: Represents 8-byte signed integer numbers. + The range of numbers is from `-9223372036854775808` to `9223372036854775807`. + - `FloatType`: Represents 4-byte single-precision floating point numbers. + - `DoubleType`: Represents 8-byte double-precision floating point numbers. + - `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale. +* String type + - `StringType`: Represents character string values. +* Binary type + - `BinaryType`: Represents byte sequence values. +* Boolean type + - `BooleanType`: Represents boolean values. +* Datetime type + - `TimestampType`: Represents values comprising values of fields year, month, day, + hour, minute, and second. + - `DateType`: Represents values comprising values of fields year, month, day. +* Complex types + - `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of + elements with the type of `elementType`. `containsNull` is used to indicate if + elements in a `ArrayType` value can have `null` values. + - `MapType(keyType, valueType, valueContainsNull)`: + Represents values comprising a set of key-value pairs. The data type of keys are + described by `keyType` and the data type of values are described by `valueType`. + For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull` + is used to indicate if values of a `MapType` value can have `null` values. + - `StructType(fields)`: Represents values with the structure described by + a sequence of `StructField`s (`fields`). + * `StructField(name, dataType, nullable)`: Represents a field in a `StructType`. + The name of a field is indicated by `name`. The data type of a field is indicated + by `dataType`. `nullable` is used to indicate if values of this fields can have + `null` values. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +All data types of Spark SQL are located in the package `org.apache.spark.sql.types`. +You can access them by doing + +{% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} + +<table class="table"> +<tr> + <th style="width:20%">Data type</th> + <th style="width:40%">Value type in Scala</th> + <th>API to access or create a data type</th></tr> +<tr> + <td> <b>ByteType</b> </td> + <td> Byte </td> + <td> + ByteType + </td> +</tr> +<tr> + <td> <b>ShortType</b> </td> + <td> Short </td> + <td> + ShortType + </td> +</tr> +<tr> + <td> <b>IntegerType</b> </td> + <td> Int </td> + <td> + IntegerType + </td> +</tr> +<tr> + <td> <b>LongType</b> </td> + <td> Long </td> + <td> + LongType + </td> +</tr> +<tr> + <td> <b>FloatType</b> </td> + <td> Float </td> + <td> + FloatType + </td> +</tr> +<tr> + <td> <b>DoubleType</b> </td> + <td> Double </td> + <td> + DoubleType + </td> +</tr> +<tr> + <td> <b>DecimalType</b> </td> + <td> java.math.BigDecimal </td> + <td> + DecimalType + </td> +</tr> +<tr> + <td> <b>StringType</b> </td> + <td> String </td> + <td> + StringType + </td> +</tr> +<tr> + <td> <b>BinaryType</b> </td> + <td> Array[Byte] </td> + <td> + BinaryType + </td> +</tr> +<tr> + <td> <b>BooleanType</b> </td> + <td> Boolean </td> + <td> + BooleanType + </td> +</tr> +<tr> + <td> <b>TimestampType</b> </td> + <td> java.sql.Timestamp </td> + <td> + TimestampType + </td> +</tr> +<tr> + <td> <b>DateType</b> </td> + <td> java.sql.Date </td> + <td> + DateType + </td> +</tr> +<tr> + <td> <b>ArrayType</b> </td> + <td> scala.collection.Seq </td> + <td> + ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br /> + <b>Note:</b> The default value of <i>containsNull</i> is <i>true</i>. + </td> +</tr> +<tr> + <td> <b>MapType</b> </td> + <td> scala.collection.Map </td> + <td> + MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br /> + <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>true</i>. + </td> +</tr> +<tr> + <td> <b>StructType</b> </td> + <td> org.apache.spark.sql.Row </td> + <td> + StructType(<i>fields</i>)<br /> + <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same + name are not allowed. + </td> +</tr> +<tr> + <td> <b>StructField</b> </td> + <td> The value type in Scala of the data type of this field + (For example, Int for a StructField with the data type IntegerType) </td> + <td> + StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br /> + <b>Note:</b> The default value of <i>nullable</i> is <i>true</i>. + </td> +</tr> +</table> + +</div> + +<div data-lang="java" markdown="1"> + +All data types of Spark SQL are located in the package of +`org.apache.spark.sql.types`. To access or create a data type, +please use factory methods provided in +`org.apache.spark.sql.types.DataTypes`. + +<table class="table"> +<tr> + <th style="width:20%">Data type</th> + <th style="width:40%">Value type in Java</th> + <th>API to access or create a data type</th></tr> +<tr> + <td> <b>ByteType</b> </td> + <td> byte or Byte </td> + <td> + DataTypes.ByteType + </td> +</tr> +<tr> + <td> <b>ShortType</b> </td> + <td> short or Short </td> + <td> + DataTypes.ShortType + </td> +</tr> +<tr> + <td> <b>IntegerType</b> </td> + <td> int or Integer </td> + <td> + DataTypes.IntegerType + </td> +</tr> +<tr> + <td> <b>LongType</b> </td> + <td> long or Long </td> + <td> + DataTypes.LongType + </td> +</tr> +<tr> + <td> <b>FloatType</b> </td> + <td> float or Float </td> + <td> + DataTypes.FloatType + </td> +</tr> +<tr> + <td> <b>DoubleType</b> </td> + <td> double or Double </td> + <td> + DataTypes.DoubleType + </td> +</tr> +<tr> + <td> <b>DecimalType</b> </td> + <td> java.math.BigDecimal </td> + <td> + DataTypes.createDecimalType()<br /> + DataTypes.createDecimalType(<i>precision</i>, <i>scale</i>). + </td> +</tr> +<tr> + <td> <b>StringType</b> </td> + <td> String </td> + <td> + DataTypes.StringType + </td> +</tr> +<tr> + <td> <b>BinaryType</b> </td> + <td> byte[] </td> + <td> + DataTypes.BinaryType + </td> +</tr> +<tr> + <td> <b>BooleanType</b> </td> + <td> boolean or Boolean </td> + <td> + DataTypes.BooleanType + </td> +</tr> +<tr> + <td> <b>TimestampType</b> </td> + <td> java.sql.Timestamp </td> + <td> + DataTypes.TimestampType + </td> +</tr> +<tr> + <td> <b>DateType</b> </td> + <td> java.sql.Date </td> + <td> + DataTypes.DateType + </td> +</tr> +<tr> + <td> <b>ArrayType</b> </td> + <td> java.util.List </td> + <td> + DataTypes.createArrayType(<i>elementType</i>)<br /> + <b>Note:</b> The value of <i>containsNull</i> will be <i>true</i><br /> + DataTypes.createArrayType(<i>elementType</i>, <i>containsNull</i>). + </td> +</tr> +<tr> + <td> <b>MapType</b> </td> + <td> java.util.Map </td> + <td> + DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>)<br /> + <b>Note:</b> The value of <i>valueContainsNull</i> will be <i>true</i>.<br /> + DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>, <i>valueContainsNull</i>)<br /> + </td> +</tr> +<tr> + <td> <b>StructType</b> </td> + <td> org.apache.spark.sql.Row </td> + <td> + DataTypes.createStructType(<i>fields</i>)<br /> + <b>Note:</b> <i>fields</i> is a List or an array of StructFields. + Also, two fields with the same name are not allowed. + </td> +</tr> +<tr> + <td> <b>StructField</b> </td> + <td> The value type in Java of the data type of this field + (For example, int for a StructField with the data type IntegerType) </td> + <td> + DataTypes.createStructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>) + </td> +</tr> +</table> + +</div> + +<div data-lang="python" markdown="1"> + +All data types of Spark SQL are located in the package of `pyspark.sql.types`. +You can access them by doing +{% highlight python %} +from pyspark.sql.types import * +{% endhighlight %} + +<table class="table"> +<tr> + <th style="width:20%">Data type</th> + <th style="width:40%">Value type in Python</th> + <th>API to access or create a data type</th></tr> +<tr> + <td> <b>ByteType</b> </td> + <td> + int or long <br /> + <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of -128 to 127. + </td> + <td> + ByteType() + </td> +</tr> +<tr> + <td> <b>ShortType</b> </td> + <td> + int or long <br /> + <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of -32768 to 32767. + </td> + <td> + ShortType() + </td> +</tr> +<tr> + <td> <b>IntegerType</b> </td> + <td> int or long </td> + <td> + IntegerType() + </td> +</tr> +<tr> + <td> <b>LongType</b> </td> + <td> + long <br /> + <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of + -9223372036854775808 to 9223372036854775807. + Otherwise, please convert data to decimal.Decimal and use DecimalType. + </td> + <td> + LongType() + </td> +</tr> +<tr> + <td> <b>FloatType</b> </td> + <td> + float <br /> + <b>Note:</b> Numbers will be converted to 4-byte single-precision floating + point numbers at runtime. + </td> + <td> + FloatType() + </td> +</tr> +<tr> + <td> <b>DoubleType</b> </td> + <td> float </td> + <td> + DoubleType() + </td> +</tr> +<tr> + <td> <b>DecimalType</b> </td> + <td> decimal.Decimal </td> + <td> + DecimalType() + </td> +</tr> +<tr> + <td> <b>StringType</b> </td> + <td> string </td> + <td> + StringType() + </td> +</tr> +<tr> + <td> <b>BinaryType</b> </td> + <td> bytearray </td> + <td> + BinaryType() + </td> +</tr> +<tr> + <td> <b>BooleanType</b> </td> + <td> bool </td> + <td> + BooleanType() + </td> +</tr> +<tr> + <td> <b>TimestampType</b> </td> + <td> datetime.datetime </td> + <td> + TimestampType() + </td> +</tr> +<tr> + <td> <b>DateType</b> </td> + <td> datetime.date </td> + <td> + DateType() + </td> +</tr> +<tr> + <td> <b>ArrayType</b> </td> + <td> list, tuple, or array </td> + <td> + ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br /> + <b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>. + </td> +</tr> +<tr> + <td> <b>MapType</b> </td> + <td> dict </td> + <td> + MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br /> + <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>. + </td> +</tr> +<tr> + <td> <b>StructType</b> </td> + <td> list or tuple </td> + <td> + StructType(<i>fields</i>)<br /> + <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same + name are not allowed. + </td> +</tr> +<tr> + <td> <b>StructField</b> </td> + <td> The value type in Python of the data type of this field + (For example, Int for a StructField with the data type IntegerType) </td> + <td> + StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br /> + <b>Note:</b> The default value of <i>nullable</i> is <i>True</i>. + </td> +</tr> +</table> + +</div> + +<div data-lang="r" markdown="1"> + +<table class="table"> +<tr> + <th style="width:20%">Data type</th> + <th style="width:40%">Value type in R</th> + <th>API to access or create a data type</th></tr> +<tr> + <td> <b>ByteType</b> </td> + <td> + integer <br /> + <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of -128 to 127. + </td> + <td> + "byte" + </td> +</tr> +<tr> + <td> <b>ShortType</b> </td> + <td> + integer <br /> + <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of -32768 to 32767. + </td> + <td> + "short" + </td> +</tr> +<tr> + <td> <b>IntegerType</b> </td> + <td> integer </td> + <td> + "integer" + </td> +</tr> +<tr> + <td> <b>LongType</b> </td> + <td> + integer <br /> + <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of + -9223372036854775808 to 9223372036854775807. + Otherwise, please convert data to decimal.Decimal and use DecimalType. + </td> + <td> + "long" + </td> +</tr> +<tr> + <td> <b>FloatType</b> </td> + <td> + numeric <br /> + <b>Note:</b> Numbers will be converted to 4-byte single-precision floating + point numbers at runtime. + </td> + <td> + "float" + </td> +</tr> +<tr> + <td> <b>DoubleType</b> </td> + <td> numeric </td> + <td> + "double" + </td> +</tr> +<tr> + <td> <b>DecimalType</b> </td> + <td> Not supported </td> + <td> + Not supported + </td> +</tr> +<tr> + <td> <b>StringType</b> </td> + <td> character </td> + <td> + "string" + </td> +</tr> +<tr> + <td> <b>BinaryType</b> </td> + <td> raw </td> + <td> + "binary" + </td> +</tr> +<tr> + <td> <b>BooleanType</b> </td> + <td> logical </td> + <td> + "bool" + </td> +</tr> +<tr> + <td> <b>TimestampType</b> </td> + <td> POSIXct </td> + <td> + "timestamp" + </td> +</tr> +<tr> + <td> <b>DateType</b> </td> + <td> Date </td> + <td> + "date" + </td> +</tr> +<tr> + <td> <b>ArrayType</b> </td> + <td> vector or list </td> + <td> + list(type="array", elementType=<i>elementType</i>, containsNull=[<i>containsNull</i>])<br /> + <b>Note:</b> The default value of <i>containsNull</i> is <i>TRUE</i>. + </td> +</tr> +<tr> + <td> <b>MapType</b> </td> + <td> environment </td> + <td> + list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br /> + <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>TRUE</i>. + </td> +</tr> +<tr> + <td> <b>StructType</b> </td> + <td> named list</td> + <td> + list(type="struct", fields=<i>fields</i>)<br /> + <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same + name are not allowed. + </td> +</tr> +<tr> + <td> <b>StructField</b> </td> + <td> The value type in R of the data type of this field + (For example, integer for a StructField with the data type IntegerType) </td> + <td> + list(name=<i>name</i>, type=<i>dataType</i>, nullable=[<i>nullable</i>])<br /> + <b>Note:</b> The default value of <i>nullable</i> is <i>TRUE</i>. + </td> +</tr> +</table> + +</div> + +</div> + +## NaN Semantics + +There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that +does not exactly match standard floating point semantics. +Specifically: + + - NaN = NaN returns true. + - In aggregations, all NaN values are grouped together. + - NaN is treated as a normal value in join keys. + - NaN values go last when in ascending order, larger than any other numeric value. + +## Arithmetic operations + +Operations performed on numeric types (with the exception of `decimal`) are not checked for overflow. +This means that in case an operation causes an overflow, the result is the same that the same operation +returns in a Java/Scala program (eg. if the sum of 2 integers is higher than the maximum value representable, +the result is a negative number). http://git-wip-us.apache.org/repos/asf/spark/blob/71535516/docs/structured-streaming-programming-guide.md ---------------------------------------------------------------------- diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index b6e4277..3678bfb 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -536,7 +536,7 @@ Here are the details of all the sources in Spark. href="api/R/read.stream.html">R</a>). E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code>. <br/><br/> - In addition, there are session configurations that affect certain file-formats. See the <a href="sql-programming-guide.html">SQL Programming Guide</a> for more details. E.g., for "parquet", see <a href="sql-programming-guide.html#configuration">Parquet configuration</a> section. + In addition, there are session configurations that affect certain file-formats. See the <a href="sql-programming-guide.html">SQL Programming Guide</a> for more details. E.g., for "parquet", see <a href="sql-data-sources-parquet.html#configuration">Parquet configuration</a> section. </td> <td>Yes</td> <td>Supports glob paths, but does not support multiple comma-separated paths/globs.</td> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org