[GitHub] [beam] TheNeuralBit commented on a diff in pull request #23224: adding examples in schema transforms section of programming guide for python (changes for issue #21022)

GitBox Wed, 26 Oct 2022 15:50:27 -0700


TheNeuralBit commented on code in PR #23224:
URL: https://github.com/apache/beam/pull/23224#discussion_r1006249234



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -4092,13 +4271,22 @@ that are likely associated with that transaction (both 
the user and product matc
 "natural join" - one in which the same field names are used on both the 
left-hand and right-hand sides of the join -
 and is specified with the `using` keyword:

Review Comment:
   Should this paragraph be in a language-java block? Similarly for the other 
Java-only schema transforms - it seems theres a lot of language that needs to 
be hidden when Python/Go are selected.



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -3815,6 +3863,14 @@ The following
 purchasesByType.apply(Select.fieldNames("purchases{}.userId"));
 {{< /highlight >}}
 
+{{< paragraph class="language-py" >}}
+Support for Nested fields hasn't been developed for python SDK yet
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for Nested fields hasn't been developed for python SDK yet

Review Comment:
   nit: in general can we make these paragraphs "Support for X hasn't been 
developed for the {Python,Go} SDK yet."



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -3726,62 +3726,99 @@ a SQL expression.
 Beam does not yet support Schema transforms natively in Go. However, it will 
be implemented with the following behavior.
 {{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 #### 6.6.1. Field selection syntax
+{{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 The advantage of schemas is that they allow referencing of element fields by 
name. Beam provides a selection syntax for
 referencing fields, including nested and repeated fields. This syntax is used 
by all of the schema transforms when
 referencing the fields they operate on. The syntax can also be used inside of 
a DoFn to specify which schema fields to
 process.
+{{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 Addressing fields by name still retains type safety as Beam will check that 
schemas match at the time the pipeline graph
 is constructed. If a field is specified that does not exist in the schema, the 
pipeline will fail to launch. In addition,
 if a field is specified with a type that does not match the type of that field 
in the schema, the pipeline will fail to
 launch.
+{{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 The following characters are not allowed in field names: . *  [ ] { }
+{{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 ##### **Top-level fields**
+{{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 In order to select a field at the top level of a schema, the name of the field 
is specified. For example, to select just
 the user ids from a `PCollection` of purchases one would write (using the 
`Select` transform)
+{{< /paragraph >}}
 
 {{< highlight java >}}
 purchases.apply(Select.fieldNames("userId"));
 {{< /highlight >}}
 
+{{< paragraph class="language-java" >}}
 ##### **Nested fields**
+{{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 Individual nested fields can be specified using the dot operator. For example, 
to select just the postal code from the
  shipping address one would write
+{{< /paragraph >}}
 
 {{< highlight java >}}
 purchases.apply(Select.fieldNames("shippingAddress.postCode"));
 {{< /highlight >}}
-
+       
+<!-- {{< highlight py >}} 
+input_pc = ... # {"user_id": ..., "shipping_address": "post_code": ..., 
"bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Select(post_code=lambda item: 
str(item["shipping_address.post_code"]))                                        
       
+{{< /highlight >}} -->
+{{< paragraph class="language-java" >}}
 ##### **Wildcards**
+{{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 The * operator can be specified at any nesting level to represent all fields 
at that level. For example, to select all
 shipping-address fields one would write
+{{< /paragraph >}}
 
 {{< highlight java >}}
 purchases.apply(Select.fieldNames("shippingAddress.*"));
 {{< /highlight >}}
 
+<!-- {{< highlight py >}} 
+input_pc = ... # {"user_id": ..., "shipping_address": "post_code": ..., 
"bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Select(shipping_address=lambda item: 
str(item["shipping_address.*"]))                                               
+{{< /highlight >}} -->

Review Comment:
   ```suggestion
   <!-- 
   {{< highlight py >}} 
   #TODO(https://github.com/apache/beam/issues/23275): Add support for 
projecting nested fields
   input_pc = ... # {"user_id": ..., "shipping_address": "post_code": ..., 
"bank": ..., "purchase_amount": ...}
   output_pc = input_pc | beam.Select("shipping_address.*"))                    
                           
   {{< /highlight >}} -->
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -4081,6 +4236,30 @@ The result of this aggregation will have the following 
schema:
   </tbody>
 </table>
 <br/>
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+The result of this aggregation will have the following schema:
+<table>
+  <thead>
+    <tr class="header">
+      <th><b>Field Name</b></th>
+      <th><b>Field Type</b></th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>key</td>
+      <td>ROW{userId:STRING}</td>
+    </tr>
+    <tr>
+      <td>value</td>
+      <td>ROW{num_purchases: INT64, total_spendcents: INT64, top_pdghurchases: 
ARRAY[INT64]}</td>

Review Comment:
   ```suggestion
         <td>ROW{num_purchases: INT64, total_spendcents: INT64, top_purchases: 
ARRAY[INT64]}</td>
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -4092,13 +4271,22 @@ that are likely associated with that transaction (both 
the user and product matc
 "natural join" - one in which the same field names are used on both the 
left-hand and right-hand sides of the join -
 and is specified with the `using` keyword:

Review Comment:
   If you'd rather not open that can of worms (there are a lot of schema 
transforms...), we could also just edit the Select/GroupBy-relevant sections in 
this PR, and leave the other ones as-is for now (and address in a follow-up PR).



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -3749,39 +3749,87 @@ the user ids from a `PCollection` of purchases one 
would write (using the `Selec
 purchases.apply(Select.fieldNames("userId"));
 {{< /highlight >}}
 
+{{< highlight py >}}
+input_pc = ... # {"user_id": ...,"bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Select("user_id")
+{{< /highlight >}}
+
 ##### **Nested fields**
 
+{{< paragraph class="language-py" >}}
+Support for Nested fields hasn't been developed for python SDK yet
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for Nested fields hasn't been developed for GO SDK yet
+{{< /paragraph >}}
+
+{{< paragraph class="language-java" >}}
 Individual nested fields can be specified using the dot operator. For example, 
to select just the postal code from the
  shipping address one would write
+{{< /paragraph >}}
 
 {{< highlight java >}}
 purchases.apply(Select.fieldNames("shippingAddress.postCode"));
 {{< /highlight >}}
-
+       
+<!-- {{< highlight py >}} 
+input_pc = ... # {"user_id": ..., "shipping_address": "post_code": ..., 
"bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Select(post_code=lambda item: 
str(item["shipping_address.post_code"]))                                        
       
+{{< /highlight >}} -->
 ##### **Wildcards**
 
+{{< paragraph class="language-py" >}}
+Support for wildcards hasn't been developed for python SDK yet
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for wildcards hasn't been developed for GO SDK yet
+{{< /paragraph >}}
+
+{{< paragraph class="language-java" >}}
 The * operator can be specified at any nesting level to represent all fields 
at that level. For example, to select all
 shipping-address fields one would write
+{{< /paragraph >}}
 
 {{< highlight java >}}
 purchases.apply(Select.fieldNames("shippingAddress.*"));
 {{< /highlight >}}
 
+<!-- {{< highlight py >}} 
+input_pc = ... # {"user_id": ..., "shipping_address": "post_code": ..., 
"bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Select(shipping_address=lambda item: 
str(item["shipping_address.*"]))                                               
+{{< /highlight >}} -->
 ##### **Arrays**
 
+{{< paragraph class="language-java" >}}
 An array field, where the array element type is a row, can also have subfields 
of the element type addressed. When
 selected, the result is an array of the selected subfield type. For example
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+Support for Nested fields hasn't been developed for python SDK yet
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for Nested fields hasn't been developed for GO SDK yet
+{{< /paragraph >}}

Review Comment:
   ```suggestion
   {{< paragraph class="language-py" >}}
   Support for Array fields hasn't been developed for python SDK yet
   {{< /paragraph >}}
   
   {{< paragraph class="language-go" >}}
   Support for Array fields hasn't been developed for GO SDK yet
   {{< /paragraph >}}
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -3815,6 +3863,14 @@ The following
 purchasesByType.apply(Select.fieldNames("purchases{}.userId"));
 {{< /highlight >}}
 
+{{< paragraph class="language-py" >}}
+Support for Nested fields hasn't been developed for python SDK yet
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for Nested fields hasn't been developed for python SDK yet

Review Comment:
   ```suggestion
   {{< paragraph class="language-py" >}}
   Support for Map fields hasn't been developed for the Python SDK yet.
   {{< /paragraph >}}
   
   {{< paragraph class="language-go" >}}
   Support for Map fields hasn't been developed for the Go SDK yet.
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -4037,6 +4156,29 @@ The output schema of this is:
   </tbody>
 </table>
 <br/>
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+<table>
+  <thead>
+    <tr class="header">
+      <th><b>Field Name</b></th>
+      <th><b>Field Type</b></th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>key</td>
+      <td>ROW{userId:STRING, bank:STRING}</td>
+    </tr>
+    <tr>
+      <td>values</td>
+      <td>ITERABLE[ROW[Purchase]]</td>
+    </tr>
+  </tbody>
+</table>
+<br/>
+{{< /paragraph >}}

Review Comment:
   It looks like our paragraph shortcode doesn't work when it has a table in 
it, both output schemas are shown:
   
![image](https://user-images.githubusercontent.com/675055/198152209-9dc9785e-7767-4f41-98d5-b9992b91b12d.png)
   
   How about we just update the Java example to use "bank" instead of the 
nested field? Then we can use the same output schema for both.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a diff in pull request #23224: adding examples in schema transforms section of programming guide for python (changes for issue #21022)

Reply via email to