[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)

ASF GitHub Bot (JIRA) Mon, 19 Nov 2018 00:19:14 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691372#comment-16691372
 ]


ASF GitHub Bot commented on FLINK-8714:
---------------------------------------

zentol closed pull request #5536: [FLINK-8714][Documentation] Added either 
charsetName) or "utf-8" value in examples of readTextFile
URL: https://github.com/apache/flink/pull/5536
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/dev/api_concepts.md b/docs/dev/api_concepts.md
index c4215074683..517d01290fc 100644
--- a/docs/dev/api_concepts.md
+++ b/docs/dev/api_concepts.md
@@ -112,7 +112,7 @@ a text file as a sequence of lines, you can use:
 {% highlight java %}
 final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
 
-DataStream<String> text = env.readTextFile("file:///path/to/file");
+DataStream<String> text = env.readTextFile("file:///path/to/file", 
StandardCharsets.UTF_8.name());
 {% endhighlight %}
 
 This will give you a DataStream on which you can then apply transformations to 
create new
@@ -181,7 +181,7 @@ a text file as a sequence of lines, you can use:
 {% highlight scala %}
 val env = StreamExecutionEnvironment.getExecutionEnvironment()
 
-val text: DataStream[String] = env.readTextFile("file:///path/to/file")
+val text: DataStream[String] = env.readTextFile("file:///path/to/file", 
StandardCharsets.UTF_8.name())
 {% endhighlight %}
 
 This will give you a DataStream on which you can then apply transformations to 
create new
diff --git a/docs/dev/batch/examples.md b/docs/dev/batch/examples.md
index a4b282688ee..bbaacb34946 100644
--- a/docs/dev/batch/examples.md
+++ b/docs/dev/batch/examples.md
@@ -68,7 +68,7 @@ WordCount is the "Hello World" of Big Data processing 
systems. It computes the f
 ~~~java
 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
 
-DataSet<String> text = env.readTextFile("/path/to/file");
+DataSet<String> text = env.readTextFile("/path/to/file", 
StandardCharsets.UTF_8.name());
 
 DataSet<Tuple2<String, Integer>> counts =
         // split up the lines in pairs (2-tuples) containing: (word,1)
@@ -106,7 +106,7 @@ The {% gh_link 
/flink-examples/flink-examples-batch/src/main/java/org/apache/fli
 val env = ExecutionEnvironment.getExecutionEnvironment
 
 // get input data
-val text = env.readTextFile("/path/to/file")
+val text = env.readTextFile("/path/to/file", StandardCharsets.UTF_8.name())
 
 val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
   .map { (_, 1) }
diff --git a/docs/dev/batch/index.md b/docs/dev/batch/index.md
index f0fab8bab6c..1b513e14415 100644
--- a/docs/dev/batch/index.md
+++ b/docs/dev/batch/index.md
@@ -809,9 +809,9 @@ shortcut methods on the *ExecutionEnvironment*.
 
 File-based:
 
-- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns 
them as Strings.
+- `readTextFile(path, charsetName)` / `TextInputFormat` - Reads files line 
wise and returns them as Strings.
 
-- `readTextFileWithValue(path)` / `TextValueInputFormat` - Reads files line 
wise and returns them as
+- `readTextFileWithValue(path, charsetName)` / `TextValueInputFormat` - Reads 
files line wise and returns them as
   StringValues. StringValues are mutable strings.
 
 - `readCsvFile(path)` / `CsvInputFormat` - Parses files of comma (or another 
char) delimited fields.
@@ -860,10 +860,10 @@ Generic:
 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
 
 // read text file from local files system
-DataSet<String> localLines = env.readTextFile("file:///path/to/my/textfile");
+DataSet<String> localLines = env.readTextFile("file:///path/to/my/textfile", 
StandardCharsets.UTF_8.name());
 
 // read text file from a HDFS running at nnHost:nnPort
-DataSet<String> hdfsLines = 
env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");
+DataSet<String> hdfsLines = 
env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile", 
StandardCharsets.UTF_8.name());
 
 // read a CSV file with three fields
 DataSet<Tuple3<Integer, String, Double>> csvInput = 
env.readCsvFile("hdfs:///the/CSV/file")
@@ -946,7 +946,7 @@ Configuration parameters = new Configuration();
 parameters.setBoolean("recursive.file.enumeration", true);
 
 // pass the configuration to the data source
-DataSet<String> logs = env.readTextFile("file:///path/with.nested/files")
+DataSet<String> logs = env.readTextFile("file:///path/with.nested/files", 
StandardCharsets.UTF_8.name())
                          .withParameters(parameters);
 {% endhighlight %}
 
@@ -962,9 +962,9 @@ shortcut methods on the *ExecutionEnvironment*.
 
 File-based:
 
-- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns 
them as Strings.
+- `readTextFile(path, charsetName)` / `TextInputFormat` - Reads files line 
wise and returns them as Strings.
 
-- `readTextFileWithValue(path)` / `TextValueInputFormat` - Reads files line 
wise and returns them as
+- `readTextFileWithValue(path, charsetName)` / `TextValueInputFormat` - Reads 
files line wise and returns them as
   StringValues. StringValues are mutable strings.
 
 - `readCsvFile(path)` / `CsvInputFormat` - Parses files of comma (or another 
char) delimited fields.
@@ -1009,10 +1009,10 @@ Generic:
 val env  = ExecutionEnvironment.getExecutionEnvironment
 
 // read text file from local files system
-val localLines = env.readTextFile("file:///path/to/my/textfile")
+val localLines = env.readTextFile("file:///path/to/my/textfile", 
StandardCharsets.UTF_8.name())
 
 // read text file from a HDFS running at nnHost:nnPort
-val hdfsLines = env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile")
+val hdfsLines = env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile", 
StandardCharsets.UTF_8.name())
 
 // read a CSV file with three fields
 val csvInput = env.readCsvFile[(Int, String, Double)]("hdfs:///the/CSV/file")
@@ -1084,7 +1084,7 @@ val parameters = new Configuration
 parameters.setBoolean("recursive.file.enumeration", true)
 
 // pass the configuration to the data source
-env.readTextFile("file:///path/with.nested/files").withParameters(parameters)
+env.readTextFile("file:///path/with.nested/files", 
StandardCharsets.UTF_8.name()).withParameters(parameters)
 {% endhighlight %}
 
 </div>
@@ -1679,7 +1679,7 @@ A LocalEnvironment is created and used as follows:
 {% highlight java %}
 final ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 
-DataSet<String> lines = env.readTextFile(pathToTextFile);
+DataSet<String> lines = env.readTextFile(pathToTextFile, charsetName);
 // build your program
 
 env.execute();
@@ -1690,7 +1690,7 @@ env.execute();
 {% highlight scala %}
 val env = ExecutionEnvironment.createLocalEnvironment()
 
-val lines = env.readTextFile(pathToTextFile)
+val lines = env.readTextFile(pathToTextFile, charsetName)
 // build your program
 
 env.execute()
diff --git a/docs/dev/cluster_execution.md b/docs/dev/cluster_execution.md
index f1d84e1b67b..125bbfa1f8e 100644
--- a/docs/dev/cluster_execution.md
+++ b/docs/dev/cluster_execution.md
@@ -64,7 +64,7 @@ public static void main(String[] args) throws Exception {
     ExecutionEnvironment env = ExecutionEnvironment
         .createRemoteEnvironment("flink-master", 6123, "/home/user/udfs.jar");
 
-    DataSet<String> data = env.readTextFile("hdfs://path/to/file");
+    DataSet<String> data = env.readTextFile("hdfs://path/to/file", 
StandardCharsets.UTF_8.name());
 
     data
         .filter(new FilterFunction<String>() {
diff --git a/docs/dev/datastream_api.md b/docs/dev/datastream_api.md
index 3cce5be1c01..7eef8b0c788 100644
--- a/docs/dev/datastream_api.md
+++ b/docs/dev/datastream_api.md
@@ -153,7 +153,7 @@ There are several predefined stream sources accessible from 
the `StreamExecution
 
 File-based:
 
-- `readTextFile(path)` - Reads text files, i.e. files that respect the 
`TextInputFormat` specification, line-by-line and returns them as Strings.
+- `readTextFile(path, charsetName)` - Reads text files, i.e. files that 
respect the `TextInputFormat` specification, line-by-line and returns them as 
Strings.
 
 - `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the 
specified file input format.
 
@@ -211,7 +211,7 @@ There are several predefined stream sources accessible from 
the `StreamExecution
 
 File-based:
 
-- `readTextFile(path)` - Reads text files, i.e. files that respect the 
`TextInputFormat` specification, line-by-line and returns them as Strings.
+- `readTextFile(path, charsetName)` - Reads text files, i.e. files that 
respect the `TextInputFormat` specification, line-by-line and returns them as 
Strings.
 
 - `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the 
specified file input format.
 
diff --git a/docs/dev/libs/storm_compatibility.md 
b/docs/dev/libs/storm_compatibility.md
index 853b8e119ce..f9327923505 100644
--- a/docs/dev/libs/storm_compatibility.md
+++ b/docs/dev/libs/storm_compatibility.md
@@ -147,7 +147,7 @@ The generic type declarations `IN` and `OUT` specify the 
type of the operator's
 <div data-lang="java" markdown="1">
 ~~~java
 StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
-DataStream<String> text = env.readTextFile(localFilePath);
+DataStream<String> text = env.readTextFile(localFilePath, 
StandardCharsets.UTF_8.name());
 
 DataStream<Tuple2<String, Integer>> counts = text.transform(
        "tokenizer", // operator name
diff --git a/docs/dev/local_execution.md b/docs/dev/local_execution.md
index 326d5157fe3..257ef724464 100644
--- a/docs/dev/local_execution.md
+++ b/docs/dev/local_execution.md
@@ -63,7 +63,7 @@ In most cases, calling 
`ExecutionEnvironment.getExecutionEnvironment()` is the e
 public static void main(String[] args) throws Exception {
     ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 
-    DataSet<String> data = env.readTextFile("file:///path/to/file");
+    DataSet<String> data = env.readTextFile("file:///path/to/file", 
StandardCharsets.UTF_8.name());
 
     data
         .filter(new FilterFunction<String>() {
diff --git a/docs/ops/deployment/aws.md b/docs/ops/deployment/aws.md
index 7ef95e7399c..e7376f5ea80 100644
--- a/docs/ops/deployment/aws.md
+++ b/docs/ops/deployment/aws.md
@@ -78,7 +78,7 @@ The endpoint can either be a single file or a directory, for 
example:
 
 ```java
 // Read from S3 bucket
-env.readTextFile("s3://<bucket>/<endpoint>");
+env.readTextFile("s3://<bucket>/<endpoint>", StandardCharsets.UTF_8.name());
 
 // Write to S3 bucket
 stream.writeAsText("s3://<bucket>/<endpoint>");


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Suggest new users to use env.readTextFile method with 2 arguments (using the 
> charset), not to rely on system charset (which varies across environments)
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8714
>                 URL: https://issues.apache.org/jira/browse/FLINK-8714
>             Project: Flink
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 1.4.0
>            Reporter: Michal Klempa
>            Priority: Trivial
>              Labels: easyfix, newbie, patch-available, pull-request-available
>
> When a newcomer (like me), goes through the docs, there are several places 
> where examples encourage to read the input data using the 
> {{env.readTextFile()}} method.
>  
> This method variant does not take a second argument - character set (see 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).]
>  This versoin relies (according to Javadoc) on " The file will be read with 
> the system's default character set. "
>  
> This behavior is also default in Java, like in the 
> {{java.util.String.getBytes()}} method, where not supplying the charset mean 
> - use the system locale or the one which JVM was started with (see 
> [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] 
> There are two ways to set locale prior to JVM start (-D arguments or set 
> LC_ALL variable).
>  
> Given this is something a new Flink user may not know about, nor he wants to 
> spend hours trying to find the environment-related bug (it works on 
> localhost, but in production the locale is different), I would kindly suggest 
> a change in documentation: lets migrate examples to use the two-argument 
> version of {{readTextFile(filePath, charsetName)}}.
>  
> I am open to criticism and suggestions. The listing of {{readTextFile}} I was 
> able to grep in docs is:
> {code:java}
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/libs/storm_compatibility.md:DataStream<String> text = 
> env.readTextFile(localFilePath);
> ./dev/cluster_execution.md:    DataSet<String> data = 
> env.readTextFile("hdfs://path/to/file");
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:DataSet<String> localLines = 
> env.readTextFile("file:///path/to/my/textfile");
> ./dev/batch/index.md:DataSet<String> hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");
> ./dev/batch/index.md:DataSet<String> logs = 
> env.readTextFile("file:///path/with.nested/files")
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:val localLines = 
> env.readTextFile("file:///path/to/my/textfile")
> ./dev/batch/index.md:val hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile")
> ./dev/batch/index.md:env.readTextFile("file:///path/with.nested/files").withParameters(parameters)
> ./dev/batch/index.md:DataSet<String> lines = env.readTextFile(pathToTextFile);
> ./dev/batch/index.md:val lines = env.readTextFile(pathToTextFile)
> ./dev/batch/examples.md:DataSet<String> text = 
> env.readTextFile("/path/to/file");
> ./dev/batch/examples.md:val text = env.readTextFile("/path/to/file")
> ./dev/api_concepts.md:DataStream<String> text = 
> env.readTextFile("file:///path/to/file");
> ./dev/api_concepts.md:val text: DataStream[String] = 
> env.readTextFile("file:///path/to/file")
> ./dev/local_execution.md:    DataSet<String> data = 
> env.readTextFile("file:///path/to/file");
> ./ops/deployment/aws.md:env.readTextFile("s3://<bucket>/<endpoint>");{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)

Reply via email to