[GitHub] [flink] morsapaes commented on a change in pull request #14041: [FLINK-20096][docs] Clean up PyFlink docs

GitBox Wed, 11 Nov 2020 23:24:08 -0800


morsapaes commented on a change in pull request #14041:
URL: https://github.com/apache/flink/pull/14041#discussion_r521871549




##########
File path: docs/dev/python/table-api-users-guide/conversion_of_pandas.md
##########
@@ -60,11 +61,11 @@ table = t_env.from_pandas(pdf,
 
 ## Convert PyFlink Table to Pandas DataFrame
 
-It also supports converting a PyFlink Table to a Pandas DataFrame. Internally, 
it will materialize the results of the 
-table and serialize them into multiple Arrow batches of Arrow columnar format 
at client side. The maximum Arrow batch size
-is determined by the config option [python.fn-execution.arrow.batch.size]({% 
link dev/python/table-api-users-guide/python_config.md 
%}#python-fn-execution-arrow-batch-size).
-The serialized data will then be converted to Pandas DataFrame. It will 
collect the content of the table to
-the client side and so please make sure that the content of the table could 
fit in memory before calling this method.
+PyFlink Tables can additionally be converted into a Pandas DataFrame.
+The resulting rows will materialized into multiple Arrow batches of Arrow 
columnar format on the client. 
+The maximum Arrow batch size is configured via the option 
[python.fn-execution.arrow.batch.size]({% link 
dev/python/table-api-users-guide/python_config.md 
%}#python-fn-execution-arrow-batch-size).
+The serialized data will then be converted to a Pandas DataFrame. 
+Because the contents of the table will be collected on the client, please 
ensure that the results of the table can fit in memory before calling this 
method.

Review comment:
       ```suggestion
   PyFlink Tables can additionally be converted into a Pandas DataFrame.
   The resulting rows will be serialized as multiple Arrow batches of Arrow 
columnar format on the client. 
   The maximum Arrow batch size is configured via the option 
[python.fn-execution.arrow.batch.size]({% link 
dev/python/table-api-users-guide/python_config.md 
%}#python-fn-execution-arrow-batch-size).
   The serialized data will then be converted to a Pandas DataFrame. 
   Because the contents of the table will be collected on the client, please 
ensure that the results can fit in memory before calling this method.
   ```

##########
File path: docs/dev/python/table-api-users-guide/conversion_of_pandas.md
##########
@@ -22,17 +22,18 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-It supports to convert between PyFlink Table and Pandas DataFrame.
+PyFlink Table API supports conversion to and from Pandas DataFrame.
 
 * This will be replaced by the TOC
 {:toc}
 
 ## Convert Pandas DataFrame to PyFlink Table
 
-It supports creating a PyFlink Table from a Pandas DataFrame. Internally, it 
will serialize the Pandas DataFrame
-using Arrow columnar format at client side and the serialized data will be 
processed and deserialized in Arrow source
-during execution. The Arrow source could also be used in streaming jobs and it 
will properly handle the checkpoint
-and provides the exactly once guarantees.
+Pandas DataFrames can be converted into a PyFlink TAble.
+Internally, PyFlink will serialize the Pandas DataFrame using Arrow columnar 
format on the client. 
+The serialized data will be processed and deserialized in Arrow source during 
execution. 
+The Arrow source can also be used in streaming jobs, and is integrated with 
checkpointing to
+and provide the exactly once guarantees.

Review comment:
       ```suggestion
   provide the exactly once guarantees.
   ```

##########
File path: docs/dev/python/table-api-users-guide/conversion_of_pandas.md
##########
@@ -22,17 +22,18 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-It supports to convert between PyFlink Table and Pandas DataFrame.
+PyFlink Table API supports conversion to and from Pandas DataFrame.
 
 * This will be replaced by the TOC
 {:toc}
 
 ## Convert Pandas DataFrame to PyFlink Table
 
-It supports creating a PyFlink Table from a Pandas DataFrame. Internally, it 
will serialize the Pandas DataFrame
-using Arrow columnar format at client side and the serialized data will be 
processed and deserialized in Arrow source
-during execution. The Arrow source could also be used in streaming jobs and it 
will properly handle the checkpoint
-and provides the exactly once guarantees.
+Pandas DataFrames can be converted into a PyFlink TAble.

Review comment:
       ```suggestion
   Pandas DataFrames can be converted into a PyFlink Table.
   ```

##########
File path: docs/dev/python/table-api-users-guide/index.md
##########
@@ -24,7 +24,6 @@ under the License.
 -->
 
 Python Table API allows users to develop [Table API]({% link 
dev/table/tableApi.md %}) programs using the Python language.

Review comment:
       ```suggestion
   The Python Table API allows users to develop [Table API]({% link 
dev/table/tableApi.md %}) programs using the Python language.
   ```

##########
File path: docs/dev/python/index.md
##########
@@ -22,3 +23,43 @@ KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
+
+<img src="{% link /fig/pyflink.svg %}" alt="PyFlink" class="offset" 
width="50%" />
+
+PyFlink is a language for building unified batch and streaming workloads.
+This means real-time streaming pipelines, performing exploratory data
+analysis at scale, building machine learning pipelines, and creating ETLs for 
a data platform.
+If you're already familiar with Python and libraries such as Pandas, then 
PyFlink makes it simple
+to leverage the full capabilities of the Apache Flink ecosystem.
+
+The PyFlink Table API makes it simple to write powerful relational queries for 
building reports and
+ETL pipelines.
+At the same time, the PyFlink DataStream API gives developers access to 
low-level control over
+state and time, unlocking the full power of stream processing.

Review comment:
       ```suggestion
   * The **PyFlink Table API** allows you to write powerful relational queries 
in a way that is similar to using SQL or working with tabular data in Python.
   
   * At the same time, the **PyFlink DataStream API** gives you lower-level 
control over the core building blocks of Flink 
([state](https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/stateful-stream-processing.html#what-is-state)
 and 
[time](https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/timely-stream-processing.html#introduction))
 to build more complex stream processing use cases.
   ```

##########
File path: docs/dev/python/table-api-users-guide/index.zh.md
##########
@@ -25,8 +25,6 @@ under the License.
 
 Python Table API允许用户使用Python语言开发[Table API]({% link dev/table/tableApi.zh.md 
%})程序。

Review comment:
       ```suggestion
   The Python Table API允许用户使用Python语言开发[Table API]({% link 
dev/table/tableApi.zh.md %})程序。
   ```

##########
File path: docs/dev/python/index.md
##########
@@ -22,3 +23,43 @@ KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
+
+<img src="{% link /fig/pyflink.svg %}" alt="PyFlink" class="offset" 
width="50%" />
+
+PyFlink is a language for building unified batch and streaming workloads.
+This means real-time streaming pipelines, performing exploratory data
+analysis at scale, building machine learning pipelines, and creating ETLs for 
a data platform.
+If you're already familiar with Python and libraries such as Pandas, then 
PyFlink makes it simple
+to leverage the full capabilities of the Apache Flink ecosystem.

Review comment:
       ```suggestion
   If you're already familiar with Python and libraries such as Pandas, then 
PyFlink makes it simpler to leverage the full capabilities of the Flink 
ecosystem. Depending on the level of abstraction you need, there are two 
different APIs that can be used in PyFlink:
   ```

##########
File path: docs/dev/python/index.md
##########
@@ -22,3 +23,43 @@ KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
+
+<img src="{% link /fig/pyflink.svg %}" alt="PyFlink" class="offset" 
width="50%" />
+
+PyFlink is a language for building unified batch and streaming workloads.
+This means real-time streaming pipelines, performing exploratory data
+analysis at scale, building machine learning pipelines, and creating ETLs for 
a data platform.
+If you're already familiar with Python and libraries such as Pandas, then 
PyFlink makes it simple
+to leverage the full capabilities of the Apache Flink ecosystem.
+
+The PyFlink Table API makes it simple to write powerful relational queries for 
building reports and
+ETL pipelines.
+At the same time, the PyFlink DataStream API gives developers access to 
low-level control over
+state and time, unlocking the full power of stream processing.

Review comment:
       Think it's important here to give users something to identify with, like 
SQL and Pandas. Also, I don't think state and time in the way that we think 
about them are straightforward for Python users that don't know Flink.

##########
File path: docs/dev/python/index.md
##########
@@ -22,3 +23,43 @@ KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
+
+<img src="{% link /fig/pyflink.svg %}" alt="PyFlink" class="offset" 
width="50%" />
+
+PyFlink is a language for building unified batch and streaming workloads.
+This means real-time streaming pipelines, performing exploratory data
+analysis at scale, building machine learning pipelines, and creating ETLs for 
a data platform.

Review comment:
       ```suggestion
   PyFlink is a Python API for Apache Flink that allows you to build scalable 
batch and streaming workloads, such as real-time data processing pipelines, 
large-scale exploratory data analysis, Machine Learning (ML) pipelines and ETL 
processes.
   ```

##########
File path: docs/dev/python/index.md
##########
@@ -22,3 +23,43 @@ KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
+
+<img src="{% link /fig/pyflink.svg %}" alt="PyFlink" class="offset" 
width="50%" />
+
+PyFlink is a language for building unified batch and streaming workloads.
+This means real-time streaming pipelines, performing exploratory data
+analysis at scale, building machine learning pipelines, and creating ETLs for 
a data platform.

Review comment:
       The way it's phrased makes it sound a bit strict on the use cases. Since 
this is in the Flink official docs, I'd also prefer not to call it a 
"language", but rather a Python API for Flink. 
   
   Will comment with a suggestion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] morsapaes commented on a change in pull request #14041: [FLINK-20096][docs] Clean up PyFlink docs

Reply via email to