psolomin commented on issue #25975: URL: https://github.com/apache/beam/issues/25975#issuecomment-1774933051
@je-ik Thanks for the tip. I've enabled the logs you mentioned, and I've seen logs like this after reducing parallelism 3 -> 2: > LateDataFilter: Dropping element at 2023-10-23T09:47:44.305Z for key:key: 0 shard: 0; window:[2023-10-23T09:47:00.000Z..2023-10-23T09:48:00.000Z) since too far behind inputWatermark:2023-10-23T09:50:03.009Z; outputWatermark:2023-10-23T09:50:03.009Z The other logs tell > WindowTracing - describePane: ON_TIME and they all are `ON_TIME` which contradicts `Dropping element ...` logs. > It might be the case that watermarks are not reconstructed correctly in this case in KinesisIO. Is it possible this issue is related to https://github.com/apache/beam/issues/28760? I also tried to assign timestamps to records: ``` public class AssignTimestampsDoFn extends DoFn<KinesisRecord, KinesisRecord> { @ProcessElement public void processElement(@Element KinesisRecord input, OutputReceiver<KinesisRecord> out) { out.outputWithTimestamp(input, Instant.now()); } @Override public Duration getAllowedTimestampSkew() { return Duration.standardSeconds(5); } } ``` ``` PCollection<KinesisRecord> windowedRecords = p.apply("Source", reader) .apply("Assign ts", ParDo.of(new AssignTimestampsDoFn())) .apply("Fixed windows", Window.<KinesisRecord>into(FixedWindows.of(Duration.standardSeconds(60)))); ``` After this, reducing parallelism was not causing data loss anymore. What might be the next steps, in your opinion? Moving forward with fixing https://github.com/apache/beam/issues/28760 in the way @sjmittal proposed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
