[
https://issues.apache.org/jira/browse/BEAM-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865968#comment-16865968
]
Ahmet Altay commented on BEAM-3000:
-----------------------------------
Offline comments from Robert:
"""
First can be implemented by a DoFn that processes a singleton PCollection and
takes the "sampled" PCollection as a side input.
SampleN, with N > number of elements that fit into memory, requires either
skewing the results or inter-worker communication. This is especially true in
the face of liquid sharding, where one does not know how many shards there
might be, and each shard can vary in size by orders of magnitude.
For very large N I would recommend the (more approximate) algorithm
count = pcoll | CountPerElement()
sampled = pcoll | beam.Filter(lambda x, count: random.random() < float(N) /
count, beam.pvalue.singleton(count))
"""
This transform could also be named Any.
> No python equivalent of org.apache.beam.sdk.transforms.Sample.any(100)?
> -----------------------------------------------------------------------
>
> Key: BEAM-3000
> URL: https://issues.apache.org/jira/browse/BEAM-3000
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Rodrigo Benenson
> Priority: Critical
> Labels: starter
>
> Java's org.apache.beam.sdk.transforms.Sample.any will return a PCollection
> with bounded size (filtering strategy).
> The closest python eqiuvalent is beam.Sample.FixedSizeGlobally(n) whover,
> this version uses a combiner strategy, returning a list with n elements;
> which does not scale if n is "bigger than what fits in memory".
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)