Re: [PR] Update managed-io.md for release 2.74.0-RC1 [beam]

via GitHub Mon, 18 May 2026 10:04:15 -0700


gemini-code-assist[bot] commented on code in PR #38527:
URL: https://github.com/apache/beam/pull/38527#discussion_r3260669271



##########
website/www/site/content/en/documentation/io/managed-io.md:
##########
@@ -418,228 +421,228 @@ and Beam SQL is invoked via the Managed API under the 
hood.
         <code style="color: green">str</code>
       </td>
       <td>
-        A list of host/port pairs to use for establishing the initial 
connection to the Kafka cluster. The client will make use of all servers 
irrespective of which servers are specified here for bootstrapping—this list 
only impacts the initial hosts used to discover the full set of servers. | 
Format: host1:port1,host2:port2,...
+        A list of host/port pairs to use for establishing the initial 
connection to the Kafka cluster. The client will make use of all servers 
irrespective of which servers are specified here for bootstrapping—this list 
only impacts the initial hosts used to discover the full set of servers. This 
list should be in the form `host1:port1,host2:port2,...`

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The description format for `bootstrap_servers` here is inconsistent with the 
one used in the `KAFKA Write` section (line 590). It is better to use a 
consistent format across the documentation.
   
   ```suggestion
           A list of host/port pairs to use for establishing the initial 
connection to the Kafka cluster. The client will make use of all servers 
irrespective of which servers are specified here for bootstrapping—this list 
only impacts the initial hosts used to discover the full set of servers. | 
Format: host1:port1,host2:port2,...
   ```



##########
website/www/site/content/en/documentation/io/managed-io.md:
##########
@@ -418,228 +421,228 @@ and Beam SQL is invoked via the Managed API under the 
hood.
         <code style="color: green">str</code>
       </td>
       <td>
-        A list of host/port pairs to use for establishing the initial 
connection to the Kafka cluster. The client will make use of all servers 
irrespective of which servers are specified here for bootstrapping—this list 
only impacts the initial hosts used to discover the full set of servers. | 
Format: host1:port1,host2:port2,...
+        A list of host/port pairs to use for establishing the initial 
connection to the Kafka cluster. The client will make use of all servers 
irrespective of which servers are specified here for bootstrapping—this list 
only impacts the initial hosts used to discover the full set of servers. This 
list should be in the form `host1:port1,host2:port2,...`
       </td>
     </tr>
     <tr>
       <td>
-        <strong>format</strong>
+        <strong>topic</strong>
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        The encoding format for the data stored in Kafka. Valid options are: 
RAW,JSON,AVRO,PROTO
+        n/a

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The description for the required field `topic` is "n/a". This should be 
updated with a meaningful description.
   
   ```suggestion
           The Kafka topic to read from.
   ```



##########
website/www/site/content/en/documentation/io/managed-io.md:
##########
@@ -809,6 +823,19 @@ and Beam SQL is invoked via the Managed API under the hood.
         For a streaming pipeline, sets the limit for lifting bundles into the 
direct write path.
       </td>
     </tr>
+    <tr>
+      <td>
+        distribution_mode
+      </td>
+      <td>
+        <code style="color: green">str</code>
+      </td>
+      <td>
+        Defines distribution of write data. Supported distributions:
+- none: don't shuffle rows (default)
+- hash: shuffle rows by partition key before writing data

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The markdown list inside the HTML `<td>` tag might not render correctly 
because it lacks proper indentation and line breaks. For consistency with other 
entries in this file, consider using `<br>` for line breaks and ensuring the 
list items are indented.
   
   ```suggestion
           Defines distribution of write data. Supported distributions:<br>
           - none: don't shuffle rows (default)<br>
           - hash: shuffle rows by partition key before writing data
   ```



##########
website/www/site/content/en/documentation/io/managed-io.md:
##########
@@ -864,6 +891,18 @@ and Beam SQL is invoked via the Managed API under the hood.
 For more information on partition transforms, please visit 
https://iceberg.apache.org/spec/#partition-transforms.
       </td>
     </tr>
+    <tr>
+      <td>
+        sort_fields
+      </td>
+      <td>
+        <code>list[<span style="color: green;">str</span>]</code>
+      </td>
+      <td>
+        Fields used to set the table's sort order, applied when the table is 
created. Each entry has the form `<term> [asc|desc] [nulls first|nulls last]`, 
where `<term>` is a field name or one of the partition transforms (e.g. 
`bucket(col, 4)`, `day(ts)`). Direction defaults to ascending; null order 
defaults to nulls-first for ascending and nulls-last for descending. Note: this 
sets the table's declared sort order as metadata; it does not cause Beam to 
physically sort records before writing.
+For more information on sort orders, please visit 
https://iceberg.apache.org/spec/#sort-orders.

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   There is a raw newline between the description and the link. This can cause 
rendering issues within an HTML table cell in some markdown processors. 
Consider using a `<br>` and maintaining consistent indentation.
   
   ```suggestion
           Fields used to set the table's sort order, applied when the table is 
created. Each entry has the form `<term> [asc|desc] [nulls first|nulls last]`, 
where `<term>` is a field name or one of the partition transforms (e.g. 
`bucket(col, 4)`, `day(ts)`). Direction defaults to ascending; null order 
defaults to nulls-first for ascending and nulls-last for descending. Note: this 
sets the table's declared sort order as metadata; it does not cause Beam to 
physically sort records before writing.<br>
           For more information on sort orders, please visit 
https://iceberg.apache.org/spec/#sort-orders.
   ```



##########
website/www/site/content/en/documentation/io/managed-io.md:
##########
@@ -1076,17 +1104,6 @@ For more information on table properties, please visit 
https://iceberg.apache.or
         n/a

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The description for the required field `jdbc_url` is "n/a". This should be 
updated with a meaningful description.
   
   ```suggestion
           Connection URL for the JDBC sink.
   ```



##########
website/www/site/content/en/documentation/io/managed-io.md:
##########
@@ -650,7 +653,7 @@ and Beam SQL is invoked via the Managed API under the hood.
         <code style="color: green">str</code>
       </td>
       <td>
-        The schema in which the data is encoded in the Kafka topic. For AVRO 
data, this is a schema defined with AVRO schema syntax 
(https://avro.apache.org/docs/1.10.2/spec.html#schemas). For JSON data, this is 
a schema defined with JSON-schema syntax (https://json-schema.org/). If a URL 
to Confluent Schema Registry is provided, then this field is ignored, and the 
schema is fetched from Confluent Schema Registry.
+        n/a

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The description for `schema` has been replaced with "n/a", which is a 
regression from the previous version. It should be restored to provide 
necessary context for users.
   
   ```suggestion
           The schema in which the data is encoded in the Kafka topic. For AVRO 
data, this is a schema defined with AVRO schema syntax 
(https://avro.apache.org/docs/1.10.2/spec.html#schemas). For JSON data, this is 
a schema defined with JSON-schema syntax (https://json-schema.org/). If a URL 
to Confluent Schema Registry is provided, then this field is ignored, and the 
schema is fetched from Confluent Schema Registry.
   ```



##########
website/www/site/content/en/documentation/io/managed-io.md:
##########
@@ -418,228 +421,228 @@ and Beam SQL is invoked via the Managed API under the 
hood.
         <code style="color: green">str</code>
       </td>
       <td>
-        A list of host/port pairs to use for establishing the initial 
connection to the Kafka cluster. The client will make use of all servers 
irrespective of which servers are specified here for bootstrapping—this list 
only impacts the initial hosts used to discover the full set of servers. | 
Format: host1:port1,host2:port2,...
+        A list of host/port pairs to use for establishing the initial 
connection to the Kafka cluster. The client will make use of all servers 
irrespective of which servers are specified here for bootstrapping—this list 
only impacts the initial hosts used to discover the full set of servers. This 
list should be in the form `host1:port1,host2:port2,...`
       </td>
     </tr>
     <tr>
       <td>
-        <strong>format</strong>
+        <strong>topic</strong>
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        The encoding format for the data stored in Kafka. Valid options are: 
RAW,JSON,AVRO,PROTO
+        n/a
       </td>
     </tr>
     <tr>
       <td>
-        <strong>topic</strong>
+        allow_duplicates
       </td>
       <td>
-        <code style="color: green">str</code>
+        <code style="color: orange">boolean</code>
       </td>
       <td>
-        n/a
+        If the Kafka read allows duplicates.
       </td>
     </tr>
     <tr>
       <td>
-        file_descriptor_path
+        confluent_schema_registry_subject
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        The path to the Protocol Buffer File Descriptor Set file. This file is 
used for schema definition and message serialization.
+        n/a
       </td>
     </tr>
     <tr>
       <td>
-        message_name
+        confluent_schema_registry_url
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        The name of the Protocol Buffer message to be used for schema 
extraction and data conversion.
+        n/a
       </td>
     </tr>
     <tr>
       <td>
-        producer_config_updates
+        consumer_config_updates
       </td>
       <td>
         <code>map[<span style="color: green;">str</span>, <span style="color: 
green;">str</span>]</code>
       </td>
       <td>
-        A list of key-value pairs that act as configuration parameters for 
Kafka producers. Most of these configurations will not be needed, but if you 
need to customize your Kafka producer, you may use this. See a detailed list: 
https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html
+        A list of key-value pairs that act as configuration parameters for 
Kafka consumers. Most of these configurations will not be needed, but if you 
need to customize your Kafka consumer, you may use this. See a detailed list: 
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html
       </td>
     </tr>
     <tr>
       <td>
-        schema
+        file_descriptor_path
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        n/a
+        The path to the Protocol Buffer File Descriptor Set file. This file is 
used for schema definition and message serialization.
       </td>
     </tr>
-  </table>
-</div>
-
-### `KAFKA` Read
-
-<div class="table-container-wrapper">
-  <table class="table table-bordered">
-    <tr>
-      <th>Configuration</th>
-      <th>Type</th>
-      <th>Description</th>
-    </tr>
     <tr>
       <td>
-        <strong>bootstrap_servers</strong>
+        format
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        A list of host/port pairs to use for establishing the initial 
connection to the Kafka cluster. The client will make use of all servers 
irrespective of which servers are specified here for bootstrapping—this list 
only impacts the initial hosts used to discover the full set of servers. This 
list should be in the form `host1:port1,host2:port2,...`
+        The encoding format for the data stored in Kafka. Valid options are: 
RAW,STRING,AVRO,JSON,PROTO
       </td>
     </tr>
     <tr>
       <td>
-        <strong>topic</strong>
+        message_name
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        n/a
+        The name of the Protocol Buffer message to be used for schema 
extraction and data conversion.
       </td>
     </tr>
     <tr>
       <td>
-        allow_duplicates
+        offset_deduplication
       </td>
       <td>
         <code style="color: orange">boolean</code>
       </td>
       <td>
-        If the Kafka read allows duplicates.
+        If the redistribute is using offset deduplication mode.
       </td>
     </tr>
     <tr>
       <td>
-        confluent_schema_registry_subject
+        redistribute_by_record_key
       </td>
       <td>
-        <code style="color: green">str</code>
+        <code style="color: orange">boolean</code>
       </td>
       <td>
-        n/a
+        If the redistribute keys by the Kafka record key.
       </td>
     </tr>
     <tr>
       <td>
-        confluent_schema_registry_url
+        redistribute_num_keys
       </td>
       <td>
-        <code style="color: green">str</code>
+        <code style="color: #f54251">int32</code>
       </td>
       <td>
-        n/a
+        The number of keys for redistributing Kafka inputs.
       </td>
     </tr>
     <tr>
       <td>
-        consumer_config_updates
+        redistributed
       </td>
       <td>
-        <code>map[<span style="color: green;">str</span>, <span style="color: 
green;">str</span>]</code>
+        <code style="color: orange">boolean</code>
       </td>
       <td>
-        A list of key-value pairs that act as configuration parameters for 
Kafka consumers. Most of these configurations will not be needed, but if you 
need to customize your Kafka consumer, you may use this. See a detailed list: 
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html
+        If the Kafka read should be redistributed.
       </td>
     </tr>
     <tr>
       <td>
-        file_descriptor_path
+        schema
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        The path to the Protocol Buffer File Descriptor Set file. This file is 
used for schema definition and message serialization.
+        The schema in which the data is encoded in the Kafka topic. For AVRO 
data, this is a schema defined with AVRO schema syntax 
(https://avro.apache.org/docs/1.10.2/spec.html#schemas). For JSON data, this is 
a schema defined with JSON-schema syntax (https://json-schema.org/). If a URL 
to Confluent Schema Registry is provided, then this field is ignored, and the 
schema is fetched from Confluent Schema Registry.
       </td>
     </tr>
+  </table>
+</div>
+
+### `KAFKA` Write
+
+<div class="table-container-wrapper">
+  <table class="table table-bordered">
+    <tr>
+      <th>Configuration</th>
+      <th>Type</th>
+      <th>Description</th>
+    </tr>
     <tr>
       <td>
-        format
+        <strong>bootstrap_servers</strong>
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        The encoding format for the data stored in Kafka. Valid options are: 
RAW,STRING,AVRO,JSON,PROTO
+        A list of host/port pairs to use for establishing the initial 
connection to the Kafka cluster. The client will make use of all servers 
irrespective of which servers are specified here for bootstrapping—this list 
only impacts the initial hosts used to discover the full set of servers. | 
Format: host1:port1,host2:port2,...
       </td>
     </tr>
     <tr>
       <td>
-        message_name
+        <strong>format</strong>
       </td>
       <td>
         <code style="color: green">str</code>
       </td>
       <td>
-        The name of the Protocol Buffer message to be used for schema 
extraction and data conversion.
+        The encoding format for the data stored in Kafka. Valid options are: 
RAW,JSON,AVRO,PROTO
       </td>
     </tr>
     <tr>
       <td>
-        offset_deduplication
+        <strong>topic</strong>
       </td>
       <td>
-        <code style="color: orange">boolean</code>
+        <code style="color: green">str</code>
       </td>
       <td>
-        If the redistribute is using offset deduplication mode.
+        n/a

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The description for the required field `topic` is "n/a". This should be 
updated with a meaningful description.
   
   ```suggestion
           The Kafka topic to write to.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Update managed-io.md for release 2.74.0-RC1 [beam]

Reply via email to