olehborysevych commented on code in PR #29507: URL: https://github.com/apache/beam/pull/29507#discussion_r1411070125
########## learning/prompts/documentation-lookup/04_basic_pcollections.md: ########## @@ -0,0 +1,22 @@ +Prompt: +Wahat is a PCollection in Apache Beam? +Response: +A `PCollection` in Apache Beam is a core abstractions representing a distributed, multi-element data set or data stream. It's the primary data structure used in Apache Beam pipelines to handle large-scale data processing, both in batch and streaming modes. + +```python +import apache_beam as beam + +with beam.Pipeline() as pipeline: + pcollection = pipeline | beam.Create([...]) # Create a PCollection +``` + +A `PCollection` can either be bounded or unbounded, making it versatile for different types of [data source](https://beam.apache.org/documentation/basics/#pcollection). Bounded `PCollection`s represent a finite data set, such as files or databases, ideal for batch processing. Unbounded `PCollection`s, on the other hand, represent data streams that continuously grow over time, such as real-time event logs, suitable for stream processing. + +Beam’s computational patterns and transforms are focused on situations where distributed data-parallel computation is required. Therefore, PCollections has the following key characteristics: + - All elements must be of the same type (with support of structured types) + - Every PCollection has a coder, which is a specification of the binary format of the elements. + - Elements cannot be altered after creation (immutability) + - No random access to individual elements of collection Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
