[ 
https://issues.apache.org/jira/browse/BEAM-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865968#comment-16865968
 ] 

Ahmet Altay commented on BEAM-3000:
-----------------------------------

Offline comments from Robert:
"""
First can be implemented by a DoFn that processes a singleton PCollection and 
takes the "sampled" PCollection as a side input. 

SampleN, with N > number of elements that fit into memory, requires either 
skewing the results or inter-worker communication. This is especially true in 
the face of liquid sharding, where one does not know how many shards there 
might be, and each shard can vary in size by orders of magnitude. 

For very large N I would recommend the (more approximate) algorithm

count = pcoll | CountPerElement()
sampled = pcoll | beam.Filter(lambda x, count: random.random() < float(N) / 
count, beam.pvalue.singleton(count))
"""

This transform could also be named Any.

> No python equivalent of org.apache.beam.sdk.transforms.Sample.any(100)?
> -----------------------------------------------------------------------
>
>                 Key: BEAM-3000
>                 URL: https://issues.apache.org/jira/browse/BEAM-3000
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Rodrigo Benenson
>            Priority: Critical
>              Labels: starter
>
> Java's org.apache.beam.sdk.transforms.Sample.any will return a PCollection 
> with bounded size (filtering strategy).
> The closest python eqiuvalent is beam.Sample.FixedSizeGlobally(n) whover, 
> this version uses a combiner strategy, returning a list with n elements; 
> which does not scale if n is "bigger than what fits in memory".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to