Re: Beam Website Feedback: Apache Beam WordCount Examples

2022-10-25 Thread Ahmet Altay via dev
Thank you for reporting William.

Adding @Valentyn Tymofieiev . Perhaps we could start
with creating an issue?

On Mon, Oct 24, 2022 at 9:25 AM William Pietri 
wrote:

> Hi! I have inherited code using Apache Beam and I'm trying to figure out
> the right way to use it.
>
> This seemed like it would be a good page for me:
> https://beam.apache.org/get-started/wordcount-example/
>
> I jumped in with the MinimalWordCount example
> ,
> which says it's explaining this python code
> .
> However, the explanation doesn't match the code.
>
> The code from the text goes like this:
>
>
> pipeline = beam.Pipeline(options=beam_options)
>
> pipeline| beam.io.ReadFromText(input_file)
>
> | 'ExtractWords' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
>
> | beam.combiners.Count.PerElement()
>
> | beam.MapTuple(lambda word, count: '%s: %s' % (word, count))
>
> | beam.io.WriteToText(output_path)
>
> Fair enough, I'd say. However, the code in the example file is quite
> different:
>
>   with beam.Pipeline(options=pipeline_options) as p:
> # Read the text file[pattern] into a PCollection.
> lines = p | ReadFromText(known_args.input)
>
> # Count the occurrences of each word.
> counts = (
> lines
> | 'Split' >> (
> beam.FlatMap(
> lambda x: re.findall(r'[A-Za-z\']+', 
> x)).with_output_types(str))
> | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
> | 'GroupAndSum' >> beam.CombinePerKey(sum))
>
> # Format the counts into a PCollection of strings.
> def format_result(word_count):
>   (word, count) = word_count
>   return '%s: %s' % (word, count)
>
> output = counts | 'Format' >> beam.Map(format_result)
>
> # Write the output using a "Write" transform that has side effects.
> # pylint: disable=expression-not-assigned
> output | WriteToText(known_args.output)
>
>
> I assume this is also probably good? But there are a number of differences
> here in both structure and content. This confusion is exactly what I don't
> need in intro documentation, so I'd love it if somebody made this
> consistent.
>
> Thanks,
>
> William
>


Beam Website Feedback: Apache Beam WordCount Examples

2022-10-24 Thread William Pietri
Hi! I have inherited code using Apache Beam and I'm trying to figure out 
the right way to use it.


This seemed like it would be a good page for me: 
https://beam.apache.org/get-started/wordcount-example/


I jumped in with the MinimalWordCount example 
, 
which says it's explaining this python code 
. 
However, the explanation doesn't match the code.


The code from the text goes like this:


   |pipeline = beam.Pipeline(options=beam_options)|

   |pipeline | beam.io.ReadFromText(input_file)|

   || 'ExtractWords' >> beam.FlatMap(lambda x:
   re.findall(r'[A-Za-z\']+', x)) |

   || beam.combiners.Count.PerElement() |

   || beam.MapTuple(lambda word, count: '%s: %s' % (word, count)) |

   || beam.io.WriteToText(output_path)|

||

||

||

Fair enough, I'd say. However, the code in the example file is quite 
different:


   |with beam.Pipeline(options=pipeline_options) as p: # Read the text
   file[pattern] into a PCollection. lines = p |
   ReadFromText(known_args.input) # Count the occurrences of each word.
   counts = ( lines | 'Split' >> ( beam.FlatMap( lambda x:
   re.findall(r'[A-Za-z\']+', x)).with_output_types(str)) |
   'PairWithOne' >> beam.Map(lambda x: (x, 1)) | 'GroupAndSum' >>
   beam.CombinePerKey(sum)) # Format the counts into a PCollection of
   strings. def format_result(word_count): (word, count) = word_count
   return '%s: %s' % (word, count) output = counts | 'Format' >>
   beam.Map(format_result) # Write the output using a "Write" transform
   that has side effects. # pylint: disable=expression-not-assigned
   output | WriteToText(known_args.output)|


I assume this is also probably good? But there are a number of 
differences here in both structure and content. This confusion is 
exactly what I don't need in intro documentation, so I'd love it if 
somebody made this consistent.


Thanks,

William