[
https://issues.apache.org/jira/browse/BEAM-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267646#comment-16267646
]
ASF GitHub Bot commented on BEAM-3247:
--------------------------------------
nevillelyh commented on a change in pull request #4175: [BEAM-3247] fix
Sample.any performance
URL: https://github.com/apache/beam/pull/4175#discussion_r153334683
##########
File path:
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java
##########
@@ -209,29 +202,67 @@ public void populateDisplayData(DisplayData.Builder
builder) {
}
/**
- * A {@link DoFn} that returns up to limit elements from the side input
PCollection.
+ * A {@link DoFn} that outputs up to limit elements.
*/
- private static class SampleAnyDoFn<T> extends DoFn<Void, T> {
- long limit;
- final PCollectionView<Iterable<T>> iterableView;
+ private static class SampleAnyDoFn<T> extends DoFn<T, T> {
Review comment:
With the current implementation, I got `java.lang.IllegalArgumentException:
Attempted to get side input window for GlobalWindow from non-global WindowFn`
with the following snippet. Am I doing something wrong?
```java
public class Test {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
p.apply(Create.of(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))
.apply(ParDo.of(new DoFn<Integer, Integer>() {
@ProcessElement
public void processElement(ProcessContext c) {
Integer x = c.element();
c.outputWithTimestamp(x, new Instant(x * 10000));
}
}))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(Sample.any(3));
p.run();
}
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Sample.any memory constraint
> ----------------------------
>
> Key: BEAM-3247
> URL: https://issues.apache.org/jira/browse/BEAM-3247
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core
> Affects Versions: 2.1.0
> Reporter: Neville Li
> Assignee: Neville Li
> Priority: Minor
>
> Right now {{Sample.any}} converts the collection to an iterable view and take
> first n in a side input. This may require materializing the entire collection
> to disk and is potentially inefficient.
> https://github.com/apache/beam/blob/v2.1.0/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java#L74
> It can be fixed by applying a truncating `DoFn` first, then a combine into
> `List<T>` which limits the list size, and finally flattening the list.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)