[spark] branch master updated: [SPARK-26016][DOCS] Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8

gurwls223 Mon, 04 Mar 2019 15:04:08 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 0deebd3  [SPARK-26016][DOCS] Clarify that text DataSource read/write, 
and RDD methods that read text, always use UTF-8
0deebd3 is described below

commit 0deebd382037a53cd63a0207009c2d21b7dd0a70
Author: Sean Owen <sean.o...@databricks.com>
AuthorDate: Tue Mar 5 08:03:39 2019 +0900

    [SPARK-26016][DOCS] Clarify that text DataSource read/write, and RDD 
methods that read text, always use UTF-8
    
    ## What changes were proposed in this pull request?
    
    Clarify that text DataSource read/write, and RDD methods that read text, 
always use UTF-8 as they use Hadoop's implementation underneath. I think these 
are all the places that this needs a mention in the user-facing docs.
    
    ## How was this patch tested?
    
    Doc tests.
    
    Closes #23962 from srowen/SPARK-26016.
    
    Authored-by: Sean Owen <sean.o...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 R/pkg/R/DataFrame.R                                                   | 2 +-
 R/pkg/R/SQLContext.R                                                  | 2 +-
 R/pkg/R/context.R                                                     | 2 +-
 core/src/main/scala/org/apache/spark/SparkContext.scala               | 3 +++
 core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala  | 4 ++++
 python/pyspark/context.py                                             | 2 ++
 python/pyspark/sql/readwriter.py                                      | 2 ++
 python/pyspark/sql/streaming.py                                       | 1 +
 python/pyspark/streaming/context.py                                   | 1 +
 sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala    | 2 ++
 sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala    | 1 +
 .../apache/spark/sql/execution/datasources/text/TextFileFormat.scala  | 2 +-
 .../main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala  | 2 ++
 .../src/main/scala/org/apache/spark/streaming/StreamingContext.scala  | 2 ++
 .../org/apache/spark/streaming/api/java/JavaStreamingContext.scala    | 2 ++
 15 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 789c5d4..5908a13 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -950,7 +950,7 @@ setMethod("write.parquet",
 #'
 #' Save the content of the SparkDataFrame in a text file at the specified path.
 #' The SparkDataFrame must have only one column of string type with the name 
"value".
-#' Each row becomes a new line in the output file.
+#' Each row becomes a new line in the output file. The text files will be 
encoded as UTF-8.
 #'
 #' @param x A SparkDataFrame
 #' @param path The directory where the file is saved
diff --git a/R/pkg/R/SQLContext.R b/R/pkg/R/SQLContext.R
index 2e5506a..5686912 100644
--- a/R/pkg/R/SQLContext.R
+++ b/R/pkg/R/SQLContext.R
@@ -469,7 +469,7 @@ read.parquet <- function(path, ...) {
 #'
 #' Loads text files and returns a SparkDataFrame whose schema starts with
 #' a string column named "value", and followed by partitioned columns if
-#' there are any.
+#' there are any. The text files must be encoded as UTF-8.
 #'
 #' Each line in the text file is a new row in the resulting SparkDataFrame.
 #'
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index bac3efd..405a3d6 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -29,7 +29,7 @@ getMinPartitions <- function(sc, minPartitions) {
 #'
 #' This function reads a text file from HDFS, a local file system (available 
on all
 #' nodes), or any Hadoop-supported file system URI, and creates an
-#' RDD of strings from it.
+#' RDD of strings from it. The text files must be encoded as UTF-8.
 #'
 #' @param sc SparkContext to use
 #' @param path Path of file to read. A vector of multiple paths is allowed.
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 6042420..dc0ea24 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -827,6 +827,8 @@ class SparkContext(config: SparkConf) extends Logging {
   /**
    * Read a text file from HDFS, a local file system (available on all nodes), 
or any
    * Hadoop-supported file system URI, and return it as an RDD of Strings.
+   * The text files must be encoded as UTF-8.
+   *
    * @param path path to the text file on a supported file system
    * @param minPartitions suggested minimum number of partitions for the 
resulting RDD
    * @return RDD of lines of the text file
@@ -843,6 +845,7 @@ class SparkContext(config: SparkConf) extends Logging {
    * Read a directory of text files from HDFS, a local file system (available 
on all nodes), or any
    * Hadoop-supported file system URI. Each file is read as a single record 
and returned in a
    * key-value pair, where the key is the path of each file, the value is the 
content of each file.
+   * The text files must be encoded as UTF-8.
    *
    * <p> For example, if you have the following files:
    * {{{
diff --git 
a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala 
b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
index 2f74d09..c5ef190 100644
--- a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
@@ -167,12 +167,14 @@ class JavaSparkContext(val sc: SparkContext) extends 
Closeable {
   /**
    * Read a text file from HDFS, a local file system (available on all nodes), 
or any
    * Hadoop-supported file system URI, and return it as an RDD of Strings.
+   * The text files must be encoded as UTF-8.
    */
   def textFile(path: String): JavaRDD[String] = sc.textFile(path)
 
   /**
    * Read a text file from HDFS, a local file system (available on all nodes), 
or any
    * Hadoop-supported file system URI, and return it as an RDD of Strings.
+   * The text files must be encoded as UTF-8.
    */
   def textFile(path: String, minPartitions: Int): JavaRDD[String] =
     sc.textFile(path, minPartitions)
@@ -183,6 +185,7 @@ class JavaSparkContext(val sc: SparkContext) extends 
Closeable {
    * Read a directory of text files from HDFS, a local file system (available 
on all nodes), or any
    * Hadoop-supported file system URI. Each file is read as a single record 
and returned in a
    * key-value pair, where the key is the path of each file, the value is the 
content of each file.
+   * The text files must be encoded as UTF-8.
    *
    * <p> For example, if you have the following files:
    * {{{
@@ -216,6 +219,7 @@ class JavaSparkContext(val sc: SparkContext) extends 
Closeable {
    * Read a directory of text files from HDFS, a local file system (available 
on all nodes), or any
    * Hadoop-supported file system URI. Each file is read as a single record 
and returned in a
    * key-value pair, where the key is the path of each file, the value is the 
content of each file.
+   * The text files must be encoded as UTF-8.
    *
    * @see `wholeTextFiles(path: String, minPartitions: Int)`.
    */
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 94c6f4a..5a4bd57 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -584,6 +584,7 @@ class SparkContext(object):
         Read a text file from HDFS, a local file system (available on all
         nodes), or any Hadoop-supported file system URI, and return it as an
         RDD of Strings.
+        The text files must be encoded as UTF-8.
 
         If use_unicode is False, the strings will be kept as `str` (encoding
         as `utf-8`), which is faster and smaller than unicode. (Added in
@@ -608,6 +609,7 @@ class SparkContext(object):
         URI. Each file is read as a single record and returned in a
         key-value pair, where the key is the path of each file, the
         value is the content of each file.
+        The text files must be encoded as UTF-8.
 
         If use_unicode is False, the strings will be kept as `str` (encoding
         as `utf-8`), which is faster and smaller than unicode. (Added in
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 3da0523..d555bad 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -327,6 +327,7 @@ class DataFrameReader(OptionUtils):
         Loads text files and returns a :class:`DataFrame` whose schema starts 
with a
         string column named "value", and followed by partitioned columns if 
there
         are any.
+        The text files must be encoded as UTF-8.
 
         By default, each line in the text file is a new row in the resulting 
DataFrame.
 
@@ -856,6 +857,7 @@ class DataFrameWriter(OptionUtils):
     @since(1.6)
     def text(self, path, compression=None, lineSep=None):
         """Saves the content of the DataFrame in a text file at the specified 
path.
+        The text files will be encoded as UTF-8.
 
         :param path: the path in any Hadoop supported file system
         :param compression: compression codec to use when saving to file. This 
can be one of the
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index b981fdc..fa25267 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -548,6 +548,7 @@ class DataStreamReader(OptionUtils):
         Loads a text file stream and returns a :class:`DataFrame` whose schema 
starts with a
         string column named "value", and followed by partitioned columns if 
there
         are any.
+        The text files must be encoded as UTF-8.
 
         By default, each line in the text file is a new row in the resulting 
DataFrame.
 
diff --git a/python/pyspark/streaming/context.py 
b/python/pyspark/streaming/context.py
index 2d84373..6fbe26b6 100644
--- a/python/pyspark/streaming/context.py
+++ b/python/pyspark/streaming/context.py
@@ -258,6 +258,7 @@ class StreamingContext(object):
         for new files and reads them as text files. Files must be wrriten to 
the
         monitored directory by "moving" them from another location within the 
same
         file system. File names starting with . are ignored.
+        The text files must be encoded as UTF-8.
         """
         return DStream(self._jssc.textFileStream(directory), self, 
UTF8Deserializer())
 
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index ff295b8..a856258 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -716,6 +716,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   /**
    * Loads text files and returns a `DataFrame` whose schema starts with a 
string column named
    * "value", and followed by partitioned columns if there are any.
+   * The text files must be encoded as UTF-8.
    *
    * By default, each line in the text files is a new row in the resulting 
DataFrame. For example:
    * {{{
@@ -753,6 +754,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   /**
    * Loads text files and returns a [[Dataset]] of String. The underlying 
schema of the Dataset
    * contains a single string column named "value".
+   * The text files must be encoded as UTF-8.
    *
    * If the directory structure of the text files contains partitioning 
information, those are
    * ignored in the resulting Dataset. To include partitioning information as 
columns, use `text`.
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index 4508281..8d4d60e 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -629,6 +629,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) 
{
    *   // Java:
    *   df.write().text("/path/to/output")
    * }}}
+   * The text files will be encoded as UTF-8.
    *
    * You can set the following option(s) for writing text files:
    * <ul>
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
index f8a24eb..60756e7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
@@ -36,7 +36,7 @@ import org.apache.spark.sql.types.{DataType, StringType, 
StructType}
 import org.apache.spark.util.SerializableConfiguration
 
 /**
- * A data source for reading text files.
+ * A data source for reading text files. The text files must be encoded as 
UTF-8.
  */
 class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
 
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala
index ef21caa..96b3a86 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala
@@ -387,6 +387,7 @@ final class DataStreamReader private[sql](sparkSession: 
SparkSession) extends Lo
   /**
    * Loads text files and returns a `DataFrame` whose schema starts with a 
string column named
    * "value", and followed by partitioned columns if there are any.
+   * The text files must be encoded as UTF-8.
    *
    * By default, each line in the text files is a new row in the resulting 
DataFrame. For example:
    * {{{
@@ -414,6 +415,7 @@ final class DataStreamReader private[sql](sparkSession: 
SparkSession) extends Lo
   /**
    * Loads text file(s) and returns a `Dataset` of String. The underlying 
schema of the Dataset
    * contains a single string column named "value".
+   * The text files must be encoded as UTF-8.
    *
    * If the directory structure of the text files contains partitioning 
information, those are
    * ignored in the resulting Dataset. To include partitioning information as 
columns, use `text`.
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
index c09cbb3..15ebef2 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
@@ -410,6 +410,8 @@ class StreamingContext private[streaming] (
    * as Text and input format as TextInputFormat). Files must be written to the
    * monitored directory by "moving" them from another location within the same
    * file system. File names starting with . are ignored.
+   * The text files must be encoded as UTF-8.
+   *
    * @param directory HDFS directory to monitor for new file
    */
   def textFileStream(directory: String): DStream[String] = 
withNamedScope("text file stream") {
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
index e61c0d4..d4f03be 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
@@ -207,6 +207,8 @@ class JavaStreamingContext(val ssc: StreamingContext) 
extends Closeable {
    * as Text and input format as TextInputFormat). Files must be written to the
    * monitored directory by "moving" them from another location within the same
    * file system. File names starting with . are ignored.
+   * The text files must be encoded as UTF-8.
+   *
    * @param directory HDFS directory to monitor for new file
    */
   def textFileStream(directory: String): JavaDStream[String] = {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26016][DOCS] Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8

Reply via email to