[jira] [Created] (FLINK-35528) Skip execution of interruptible mails when yielding
Piotr Nowojski created FLINK-35528: -- Summary: Skip execution of interruptible mails when yielding Key: FLINK-35528 URL: https://issues.apache.org/jira/browse/FLINK-35528 Project: Flink Issue Type: Sub-task Components: Runtime / Checkpointing, Runtime / Task Affects Versions: 1.20.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski When operators are yielding, for example waiting for async state access to complete before a checkpoint, it would be beneficial to not execute interruptible mails. Otherwise continuation mail for firing timers would be continuously re-enqeueed. To achieve that MailboxExecutor must be aware which mails are interruptible. The easiest way to achieve this is to set MIN_PRIORITY for interruptible mails. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35518) CI Bot doesn't run on PRs
Piotr Nowojski created FLINK-35518: -- Summary: CI Bot doesn't run on PRs Key: FLINK-35518 URL: https://issues.apache.org/jira/browse/FLINK-35518 Project: Flink Issue Type: Bug Components: Build System / CI Affects Versions: 1.20.0 Reporter: Piotr Nowojski Doesn't want to run on my PR/branch. I was doing force-pushes, rebases, asking flink bot to run, closed and opened new PR, but nothing helped https://github.com/apache/flink/pull/24868 https://github.com/apache/flink/pull/24883 I've heard others were having similar problems recently. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35420) WordCountMapredITCase fails to compile in IntelliJ
Piotr Nowojski created FLINK-35420: -- Summary: WordCountMapredITCase fails to compile in IntelliJ Key: FLINK-35420 URL: https://issues.apache.org/jira/browse/FLINK-35420 Project: Flink Issue Type: Bug Components: Connectors / Hadoop Compatibility Affects Versions: 1.20.0 Reporter: Piotr Nowojski {noformat} flink-connectors/flink-hadoop-compatibility/src/test/scala/org/apache/flink/api/hadoopcompatibility/scala/WordCountMapredITCase.scala:42:8 value isFalse is not a member of ?0 possible cause: maybe a semicolon is missing before `value isFalse'? .isFalse() {noformat} Might be caused by: https://youtrack.jetbrains.com/issue/SCL-20679 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35065) Add numFiredTimers and numFiredTimersPerSecond metrics
Piotr Nowojski created FLINK-35065: -- Summary: Add numFiredTimers and numFiredTimersPerSecond metrics Key: FLINK-35065 URL: https://issues.apache.org/jira/browse/FLINK-35065 Project: Flink Issue Type: Improvement Components: Runtime / Metrics, Runtime / Task Affects Versions: 1.19.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.20.0 Currently there is now way of knowing how many timers are being fired by Flink, so it's impossible to distinguish, even using code profiling, if operator is firing only a couple of heavy timers per second using ~100% of the CPU time, vs firing thousands of timer per seconds. We could add the following metrics to address this issue: * numFiredTimers - total number of fired timers per operator * numFiredTimersPerSecond - per second rate of firing timers per operator -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35051) Weird priorities when processing unaligned checkpoints
Piotr Nowojski created FLINK-35051: -- Summary: Weird priorities when processing unaligned checkpoints Key: FLINK-35051 URL: https://issues.apache.org/jira/browse/FLINK-35051 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing, Runtime / Network, Runtime / Task Affects Versions: 1.18.1, 1.19.0, 1.17.2 Reporter: Piotr Nowojski While looking through the code I noticed that `StreamTask` is processing unaligned checkpoints in strange order/priority. The end result is that unaligned checkpoint `Start Delay` time can be increased, and triggering checkpoints in `StreamTask` can be unnecessary delayed by other mailbox actions in the system, like for example: * processing time timers * `AsyncWaitOperator` results * ... Incoming UC barrier is treated as a priority event by the network stack (it will be polled from the input before anything else). This is what we want, but polling elements from network stack has lower priority then processing enqueued mailbox actions. Secondly, if AC barrier timeout to UC, that's done via a mailbox action, but this mailbox action is also not prioritised in any way, so other mailbox actions could be unnecessarily executed first. On top of that there is a clash of two separate concepts here: # Mailbox priority. yieldToDownstream - so in a sense reverse to what we would like to have for triggering checkpoint, but that only kicks in #yield() calls, where it's actually correct, that operator in a middle of execution can not yield to checkpoint - it should only yield to downstream. # Control mails in mailbox executor - cancellation is done via that, it bypasses whole mailbox queue. # Priority events in the network stack. It's unfortunate that 1. vs 3. has a naming clash, as priority name is used in both things, and highest network priority event containing UC barrier, when executed via mailbox has actually the lowest mailbox priority. Control mails mechanism is a kind of priority mails executed out of order, but doesn't generalise well for use in checkpointing. This whole thing should be re-worked at some point. Ideally what we would like have is that: * mail to convert AC barriers to UC * polling UC barrier from the network input * checkpoint trigger via RPC for source tasks should be processed first, with an exception of yieldToDownstream, where current mailbox priorities should be adhered. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34913) ConcurrentModificationException SubTaskInitializationMetricsBuilder.addDurationMetric
Piotr Nowojski created FLINK-34913: -- Summary: ConcurrentModificationException SubTaskInitializationMetricsBuilder.addDurationMetric Key: FLINK-34913 URL: https://issues.apache.org/jira/browse/FLINK-34913 Project: Flink Issue Type: Bug Components: Runtime / Metrics, Runtime / State Backends Affects Versions: 1.19.0 Reporter: Piotr Nowojski Fix For: 1.19.1 The following failures can occur during job's recovery when using clip & ingest {noformat} java.util.ConcurrentModificationException at java.base/java.util.HashMap.compute(HashMap.java:1230) at org.apache.flink.runtime.checkpoint.SubTaskInitializationMetricsBuilder.addDurationMetric(SubTaskInitializationMetricsBuilder.java:49) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.runAndReportDuration(RocksDBIncrementalRestoreOperation.java:988) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.lambda$createAsyncCompactionTask$2(RocksDBIncrementalRestoreOperation.java:291) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) {noformat} {noformat} java.util.ConcurrentModificationException at java.base/java.util.HashMap.compute(HashMap.java:1230) at org.apache.flink.runtime.checkpoint.SubTaskInitializationMetricsBuilder.addDurationMetric(SubTaskInitializationMetricsBuilder.java:49) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateAndGates(StreamTask.java:794) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$restoreInternal$3(StreamTask.java:744) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:744) at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:704) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:953) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:922) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:746) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) at java.base/java.lang.Thread.run(Thread.java:829) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34430) Akka frame size exceeded with many ByteStreamStateHandle being used
Piotr Nowojski created FLINK-34430: -- Summary: Akka frame size exceeded with many ByteStreamStateHandle being used Key: FLINK-34430 URL: https://issues.apache.org/jira/browse/FLINK-34430 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.18.1, 1.17.2, 1.16.3, 1.19.0 Reporter: Piotr Nowojski -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33775) Report JobInitialization traces
Piotr Nowojski created FLINK-33775: -- Summary: Report JobInitialization traces Key: FLINK-33775 URL: https://issues.apache.org/jira/browse/FLINK-33775 Project: Flink Issue Type: Sub-task Reporter: Piotr Nowojski Assignee: Piotr Nowojski -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33709) Report CheckpointStats as Spans
Piotr Nowojski created FLINK-33709: -- Summary: Report CheckpointStats as Spans Key: FLINK-33709 URL: https://issues.apache.org/jira/browse/FLINK-33709 Project: Flink Issue Type: Sub-task Reporter: Piotr Nowojski -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33708) Add Span and TraceReporter concepts
Piotr Nowojski created FLINK-33708: -- Summary: Add Span and TraceReporter concepts Key: FLINK-33708 URL: https://issues.apache.org/jira/browse/FLINK-33708 Project: Flink Issue Type: Sub-task Reporter: Piotr Nowojski -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33696) FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter
Piotr Nowojski created FLINK-33696: -- Summary: FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter Key: FLINK-33696 URL: https://issues.apache.org/jira/browse/FLINK-33696 Project: Flink Issue Type: New Feature Components: Runtime / Metrics Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.19.0 h1. Motivation [FLIP-384|https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces] is adding TraceReporter interface. However with [FLIP-384|https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces] alone, Log4jTraceReporter would be the only available implementation of TraceReporter interface, which is not very helpful. In this FLIP I’m proposing to contribute both MetricExporter and TraceReporter implementation using OpenTelemetry. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33697) FLIP-386: Support adding custom metrics in Recovery Spans
Piotr Nowojski created FLINK-33697: -- Summary: FLIP-386: Support adding custom metrics in Recovery Spans Key: FLINK-33697 URL: https://issues.apache.org/jira/browse/FLINK-33697 Project: Flink Issue Type: New Feature Components: Runtime / Metrics, Runtime / State Backends Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.19.0 h1. Motivation FLIP-386 is building on top of [FLIP-384|https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces]. The intention here is to add a capability for state backends to attach custom attributes during recovery to recovery spans. For example RocksDBIncrementalRestoreOperation could report both remote download time and time to actually clip/ingest the RocksDB instances after rescaling. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33695) FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces
Piotr Nowojski created FLINK-33695: -- Summary: FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces Key: FLINK-33695 URL: https://issues.apache.org/jira/browse/FLINK-33695 Project: Flink Issue Type: New Feature Components: Runtime / Checkpointing, Runtime / Metrics Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.19.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33338) Bump up RocksDB version to 7.x
Piotr Nowojski created FLINK-8: -- Summary: Bump up RocksDB version to 7.x Key: FLINK-8 URL: https://issues.apache.org/jira/browse/FLINK-8 Project: Flink Issue Type: Sub-task Components: Runtime / State Backends Reporter: Piotr Nowojski We need to bump RocksDB in order to be able to use new IngestDB and ClipDB commands. If some of the required changes haven't been merged to Facebook/RocksDB, we should cherry-pick and include them in our FRocksDB fork. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33337) Expose IngestDB and ClipDB in the official RocksDB API
Piotr Nowojski created FLINK-7: -- Summary: Expose IngestDB and ClipDB in the official RocksDB API Key: FLINK-7 URL: https://issues.apache.org/jira/browse/FLINK-7 Project: Flink Issue Type: Sub-task Components: Runtime / State Backends Reporter: Piotr Nowojski Assignee: Yue Ma Remaining open PR: https://github.com/facebook/rocksdb/pull/11646 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33071) Log checkpoint statistics
Piotr Nowojski created FLINK-33071: -- Summary: Log checkpoint statistics Key: FLINK-33071 URL: https://issues.apache.org/jira/browse/FLINK-33071 Project: Flink Issue Type: New Feature Components: Runtime / Checkpointing, Runtime / Metrics Affects Versions: 1.18.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski This is a stop gap solution until we have a proper way of solving FLINK-23411. The plan is to dump JSON serialised checkpoint statistics into Flink JM's log, with a {{DEBUG}} level. This could be used to analyse what has happened with a certain checkpoint in the past. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32303) Incorrect error message in KafkaSource
Piotr Nowojski created FLINK-32303: -- Summary: Incorrect error message in KafkaSource Key: FLINK-32303 URL: https://issues.apache.org/jira/browse/FLINK-32303 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.17.1, 1.18.0 Reporter: Piotr Nowojski When exception is thrown from an operator chained with a KafkaSource, KafkaSource is returning a misleading error, like shown below: {noformat} java.io.IOException: Failed to deserialize consumer record due to at org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:56) ~[classes/:?] at org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:33) ~[classes/:?] at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:160) ~[classes/:?] at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417) ~[classes/:?] at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) ~[classes/:?] at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:562) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:852) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:801) ~[classes/:?] at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) ~[classes/:?] at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) [classes/:?] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) [classes/:?] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) [classes/:?] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:92) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:309) ~[classes/:?] at org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110) ~[classes/:?] at org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter$SourceOutputWrapper.collect(KafkaRecordEmitter.java:67) ~[classes/:?] at org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:84) ~[classes/:?] at org.apache.flink.connector.kafka.source.reader.deserializer.KafkaValueOnlyDeserializationSchemaWrapper.deserialize(KafkaValueOnlyDeserializationSchemaWrapper.java:51) ~[classes/:?] at org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:53) ~[classes/:?] ... 14 more Caused by: org.apache.flink.runtime.operators.testutils.ExpectedTestException: Failover! at org.apache.flink.test.ManySmallJobsBenchmarkITCase$ThrottlingAndFailingIdentityMap.map(ManySmallJobsBenchmarkITCase.java:263) ~[test-classes/:?] at org.apache.flink.test.ManySmallJobsBenchmarkITCase$ThrottlingAndFailingIdentityMap.map(ManySmallJobsBenchmarkITCase.java:243) ~[test-classes/:?] at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:309) ~[classes/:?] at org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110) ~[classes/:?]
[jira] [Created] (FLINK-31045) In IntelliJ: flink-clients cannot find symbol symbol: class TestUserClassLoaderJobLib
Piotr Nowojski created FLINK-31045: -- Summary: In IntelliJ: flink-clients cannot find symbol symbol: class TestUserClassLoaderJobLib Key: FLINK-31045 URL: https://issues.apache.org/jira/browse/FLINK-31045 Project: Flink Issue Type: Bug Components: Build System, Client / Job Submission Affects Versions: 1.18.0 Reporter: Piotr Nowojski When trying to build/run some tests in the IDE, IntelliJ is reporting the following compilation failure: {noformat} /XXX/flink/flink-clients/src/test/java/org/apache/flink/client/testjar/TestUserClassLoaderJob.java:33:38 java: cannot find symbol symbol: class TestUserClassLoaderJobLib location: class org.apache.flink.client.testjar.TestUserClassLoaderJob {noformat} A workaround seems to be to: # right click on {{flink-clients}} # Rebuild module (flink-clients) The issue is probably related to the comment from the flink-clients/pom.xml file: {noformat} {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30791) Codespeed machine is not responding
Piotr Nowojski created FLINK-30791: -- Summary: Codespeed machine is not responding Key: FLINK-30791 URL: https://issues.apache.org/jira/browse/FLINK-30791 Project: Flink Issue Type: Bug Components: Benchmarks Affects Versions: 1.16.0, 1.17.0 Reporter: Piotr Nowojski Neither speedcenter: [http://codespeed.dak8s.net:8000/] nor jenkins: [http://codespeed.dak8s.net:8080|http://codespeed.dak8s.net:8080/] are responding -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30155) Pretty print MutatedConfigurationException
Piotr Nowojski created FLINK-30155: -- Summary: Pretty print MutatedConfigurationException Key: FLINK-30155 URL: https://issues.apache.org/jira/browse/FLINK-30155 Project: Flink Issue Type: New Feature Components: API / DataStream Affects Versions: 1.17.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.17.0 Currently MutatedConfigurationException is printed as: {noformat} org.apache.flink.client.program.MutatedConfigurationException: Configuration execution.sorted-inputs.enabled:true not allowed. Configuration execution.runtime-mode was changed from STREAMING to BATCH. Configuration execution.checkpointing.interval:500 ms not allowed in the configuration object CheckpointConfig. Configuration execution.checkpointing.mode:EXACTLY_ONCE not allowed in the configuration object CheckpointConfig. Configuration pipeline.max-parallelism:1024 not allowed in the configuration object ExecutionConfig. Configuration parallelism.default:25 not allowed in the configuration object ExecutionConfig. at org.apache.flink.client.program.StreamContextEnvironment.checkNotAllowedConfigurations(StreamContextEnvironment.java:235) at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:175) at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:115) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2049) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2027) {noformat} Which is slightly confusing. First not allowed configuration is listed in the same line as the exception name, which (especially if wrapped) can make it more difficult than necessary for user to understand that this is a list of violations. I'm proposing to change it to: {noformat} org.apache.flink.client.program.MutatedConfigurationException: Not allowed configuration change(s) were detected: - Configuration execution.sorted-inputs.enabled:true not allowed. - Configuration execution.runtime-mode was changed from STREAMING to BATCH. - Configuration execution.checkpointing.interval:500 ms not allowed in the configuration object CheckpointConfig. - Configuration execution.checkpointing.mode:EXACTLY_ONCE not allowed in the configuration object CheckpointConfig. - Configuration pipeline.max-parallelism:1024 not allowed in the configuration object ExecutionConfig. - Configuration parallelism.default:25 not allowed in the configuration object ExecutionConfig. at org.apache.flink.client.program.StreamContextEnvironment.checkNotAllowedConfigurations(StreamContextEnvironment.java:235) at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:175) at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:115) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2049) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2027) {noformat} To make it more clear. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30070) Create savepoints without side effects
Piotr Nowojski created FLINK-30070: -- Summary: Create savepoints without side effects Key: FLINK-30070 URL: https://issues.apache.org/jira/browse/FLINK-30070 Project: Flink Issue Type: New Feature Components: API / DataStream, Runtime / Checkpointing Affects Versions: 1.14.6, 1.15.2, 1.16.0 Reporter: Piotr Nowojski Side effects are any external state - a state that is stored not in Flink, but in an external system, like for example connectors transactions (KafkaSink, ...). We shouldn't be relaying on the external systems for storing part of the job's state, especially for any long period of time. The most prominent issue is that Kafka transactions can time out, leading to a data loss if transaction hasn't been committed. Stop-with-savepoint, currently guarantee that {{notifyCheckpointCompleted}} call will be issued, so properly implemented operators are guaranteed to committed it's state. However this information is currently not stored in the checkpoint in any way ( FLINK-16419 ). Larger issue is how to deal with savepoints, since there we currently do not have any guarantees that transactions have been committed. Some potential solution might be to expand API (like {{CheckpointedFunction}} ), to let the operators/functions know, that they should close/commit/clear/deal with external state differently and use that API during stop-with-savepoint + rework how regular savepoints are handled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30068) Allow users to configure what to do with errors while committing transactions during recovery in KafkaSink
Piotr Nowojski created FLINK-30068: -- Summary: Allow users to configure what to do with errors while committing transactions during recovery in KafkaSink Key: FLINK-30068 URL: https://issues.apache.org/jira/browse/FLINK-30068 Project: Flink Issue Type: Sub-task Components: Connectors / Kafka Affects Versions: 1.15.2, 1.16.0, 1.17.0 Reporter: Piotr Nowojski Fix For: 1.17.0 Currently it looks like {{KafkaSink}} fails the job on any failures to commit transactions. As [reported by the user|https://lists.apache.org/thread/4f6bb8j6qtvgp888y4dxgj86x3kw2b11], this makes impossible for jobs to recover from older Savepoints. {noformat} 2022-11-16 10:01:07.168 [flink-akka.actor.default-dispatcher-13] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Balances aggreagation ETH -> (Filter -> Map -> Save to Kafka realtime ETH: Writer -> Save to Kafka realtime ETH: Committer, Filter -> Map -> Save to Kafka daily ETH: Writer -> Save to Kafka daily ETH: Committer) (4/5) (6d4d91ab8657bba830695b9a011f7db6) switched from INITIALIZING to RUNNING. 2022-11-16 10:01:37.222 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 65436 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1668592897201 for job . 2022-11-16 10:01:39.082 [flink-akka.actor.default-dispatcher-13] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Balances aggreagation ETH -> (Filter -> Map -> Save to Kafka realtime ETH: Writer -> Save to Kafka realtime ETH: Committer, Filter -> Map -> Save to Kafka daily ETH: Writer -> Save to Kafka daily ETH: Committer) (1/5) (cfaca46e7f4dc89629cdcaed5b48c059) switched from RUNNING to FAILED on 10.42.145.181:33297-efc328 @ eth-top-holders-v2-flink-taskmanager-0.eth-top-holders-v2-flink-taskmanager.flink.svc.cluster.local (dataPort=43125). java.io.IOException: Could not perform checkpoint 65436 for operator Balances aggreagation ETH -> (Filter -> Map -> Save to Kafka realtime ETH: Writer -> Save to Kafka realtime ETH: Committer, Filter -> Map -> Save to Kafka daily ETH: Writer -> Save to Kafka daily ETH: Committer) (1/5)#0. at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1210) at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint(SingleCheckpointBarrierHandler.java:287) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100(SingleCheckpointBarrierHandler.java:64) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint(SingleCheckpointBarrierHandler.java:493) at org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint(AbstractAlignedBarrierHandlerState.java:74) at org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived(AbstractAlignedBarrierHandlerState.java:66) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2(SingleCheckpointBarrierHandler.java:234) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState(SingleCheckpointBarrierHandler.java:262) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:231) at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181) at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:519) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948) at
[jira] [Created] (FLINK-29888) Improve MutatedConfigurationException for disallowed changes in CheckpointConfig and ExecutionConfig
Piotr Nowojski created FLINK-29888: -- Summary: Improve MutatedConfigurationException for disallowed changes in CheckpointConfig and ExecutionConfig Key: FLINK-29888 URL: https://issues.apache.org/jira/browse/FLINK-29888 Project: Flink Issue Type: Improvement Components: API / DataStream Affects Versions: 1.17.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.17.0 Currently if {{CheckpointConfig}} or {{ExecutionConfig}} are modified in a non-allowed way, user gets a generic error "Configuration object ExecutionConfig changed", without a hint of what has been modified. With FLINK-29379 we can improve this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29886) networkThroughput 1000,100ms,OpenSSL Benchmark is failing
Piotr Nowojski created FLINK-29886: -- Summary: networkThroughput 1000,100ms,OpenSSL Benchmark is failing Key: FLINK-29886 URL: https://issues.apache.org/jira/browse/FLINK-29886 Project: Flink Issue Type: Bug Components: Benchmarks, Runtime / Network Affects Versions: 1.17.0 Reporter: Piotr Nowojski http://codespeed.dak8s.net:8080/job/flink-master-benchmarks-java8/837/console {noformat} 04:09:40 java.lang.NoClassDefFoundError: org/apache/flink/shaded/netty4/io/netty/internal/tcnative/CertificateCompressionAlgo 04:09:40at org.apache.flink.shaded.netty4.io.netty.handler.ssl.OpenSslX509KeyManagerFactory$OpenSslKeyManagerFactorySpi.engineInit(OpenSslX509KeyManagerFactory.java:129) 04:09:40at javax.net.ssl.KeyManagerFactory.init(KeyManagerFactory.java:256) 04:09:40at org.apache.flink.runtime.net.SSLUtils.getKeyManagerFactory(SSLUtils.java:279) 04:09:40at org.apache.flink.runtime.net.SSLUtils.createInternalNettySSLContext(SSLUtils.java:324) 04:09:40at org.apache.flink.runtime.net.SSLUtils.createInternalNettySSLContext(SSLUtils.java:303) 04:09:40at org.apache.flink.runtime.net.SSLUtils.createInternalClientSSLEngineFactory(SSLUtils.java:119) 04:09:40at org.apache.flink.runtime.io.network.netty.NettyConfig.createClientSSLEngineFactory(NettyConfig.java:147) 04:09:40at org.apache.flink.runtime.io.network.netty.NettyClient.init(NettyClient.java:115) 04:09:40at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.start(NettyConnectionManager.java:87) 04:09:40at org.apache.flink.runtime.io.network.NettyShuffleEnvironment.start(NettyShuffleEnvironment.java:349) 04:09:40at org.apache.flink.streaming.runtime.io.benchmark.StreamNetworkBenchmarkEnvironment.setUp(StreamNetworkBenchmarkEnvironment.java:133) 04:09:40at org.apache.flink.streaming.runtime.io.benchmark.StreamNetworkThroughputBenchmark.setUp(StreamNetworkThroughputBenchmark.java:108) 04:09:40at org.apache.flink.benchmark.StreamNetworkThroughputBenchmarkExecutor$MultiEnvironment.setUp(StreamNetworkThroughputBenchmarkExecutor.java:117) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29807) Drop TypeSerializerConfigSnapshot and savepoint support from Flink versions < 1.8.0
Piotr Nowojski created FLINK-29807: -- Summary: Drop TypeSerializerConfigSnapshot and savepoint support from Flink versions < 1.8.0 Key: FLINK-29807 URL: https://issues.apache.org/jira/browse/FLINK-29807 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.17 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.17 The motivation behind this move is two fold. One reason is that it complicates our code base unnecessarily and creates confusion on how to actually implement custom serializers. The immediate reason is that I wanted to clean up Flink's configuration stack a bit and refactor the ExecutionConfig class [2]. This refactor would keep the API compatibility of the ExecutionConfig, but it would break savepoint compatibility with snapshots written with some of the old serializers, which had ExecutionConfig as a field and were serialized in the snapshot. This issue has been resolved by the introduction of TypeSerializerSnapshot in Flink 1.7 [3], where serializers are no longer part of the snapshot. TypeSerializerConfigSnapshot has been deprecated and no longer used by built-in serializers since Flink 1.8 [4] and [5]. Users were encouraged to migrate to TypeSerializerSnapshot since then with their own custom serializers. That has been plenty of time for the migration. This proposal would have the following impact for the users: 1. we would drop support for recovery from savepoints taken with Flink < 1.7.0 for all built in types serializers 2. we would drop support for recovery from savepoints taken with Flink < 1.8.0 for built in kryo serializers 3. we would drop support for recovery from savepoints taken with Flink < 1.17 for custom serializers using deprecated TypeSerializerConfigSnapshot 1. and 2. would have a simple migration path. Users migrating from those old savepoints would have to first start his job using a Flink version from the [1.8, 1.16] range, and take a new savepoint that would be compatible with Flink 1.17. 3. This is a bit more problematic, because users would have to first migrate their own custom serializers to use TypeSerializerSnapshot (using a Flink version from the [1.8, 1.16]), take a savepoint, and only then migrate to Flink 1.17. However users had already 4 years to migrate, which in my opinion has been plenty of time to do so. *As discussed and vote is currently in progress*: https://lists.apache.org/thread/x5d0p08pf2wx47njogsgqct0k5rpfrl4 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29662) PojoSerializerSnapshot is using incorrect ExecutionConfig when restoring serializer
Piotr Nowojski created FLINK-29662: -- Summary: PojoSerializerSnapshot is using incorrect ExecutionConfig when restoring serializer Key: FLINK-29662 URL: https://issues.apache.org/jira/browse/FLINK-29662 Project: Flink Issue Type: Bug Components: API / Type Serialization System Affects Versions: 1.17.0 Reporter: Piotr Nowojski {{org.apache.flink.api.java.typeutils.runtime.PojoSerializerSnapshot#restoreSerializer}} is using freshly created {{new ExecutionConfig()}} execution config for the restored serializer, which overrides any configuration choices made by the user. Most likely this is a dormant bug, since restored serializer shouldn't be used for serializing any new data, and the {{ExecutionConfig}} is only used for subclasses serializations that haven't been cached. If this is indeed the case, I would propose to change it's value to {{null}} and safeguard accesses to that field with an {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29645) BatchExecutionKeyedStateBackend is using incorrect ExecutionConfig when creating serializer
Piotr Nowojski created FLINK-29645: -- Summary: BatchExecutionKeyedStateBackend is using incorrect ExecutionConfig when creating serializer Key: FLINK-29645 URL: https://issues.apache.org/jira/browse/FLINK-29645 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.14.6, 1.15.2, 1.13.6, 1.12.7, 1.16.0, 1.17.0 Reporter: Piotr Nowojski {{org.apache.flink.streaming.api.operators.sorted.state.BatchExecutionKeyedStateBackend#getOrCreateKeyedState}} is using freshly constructed {{ExecutionConfig}}, instead of the one configured by the user from the environment. {code:java} public S getOrCreateKeyedState( TypeSerializer namespaceSerializer, StateDescriptor stateDescriptor) throws Exception { checkNotNull(namespaceSerializer, "Namespace serializer"); checkNotNull( keySerializer, "State key serializer has not been configured in the config. " + "This operation cannot use partitioned state."); if (!stateDescriptor.isSerializerInitialized()) { stateDescriptor.initializeSerializerUnlessSet(new ExecutionConfig()); } {code} The correct one could be obtained from {{env.getExecutionConfig()}} in {{org.apache.flink.streaming.api.operators.sorted.state.BatchExecutionStateBackend#createKeyedStateBackend}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29584) Upgrade java 11 version on the microbenchmark worker
Piotr Nowojski created FLINK-29584: -- Summary: Upgrade java 11 version on the microbenchmark worker Key: FLINK-29584 URL: https://issues.apache.org/jira/browse/FLINK-29584 Project: Flink Issue Type: Improvement Components: Benchmarks Affects Versions: 1.17.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.17.0 It looks like the older JDK 11 builds have problems with JIT in the microbenchmarks, as for example visible [here|http://codespeed.dak8s.net:8000/timeline/?ben=globalWindow=2]. Locally I was able to reproduce this problem and the issue goes away after upgrading to a newer JDK 11 build. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28835) Savepoint and checkpoint capabilities and limitations table is incorrect
Piotr Nowojski created FLINK-28835: -- Summary: Savepoint and checkpoint capabilities and limitations table is incorrect Key: FLINK-28835 URL: https://issues.apache.org/jira/browse/FLINK-28835 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.15.1, 1.16.0 Reporter: Piotr Nowojski Fix For: 1.16.0, 1.15.2 https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/ is inconsistent with https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints#FLIP203:Incrementalsavepoints-Proposal. "Non-arbitrary job upgrade" for unaligned checkpoints should be officially supported. It looks like a typo in the original PR FLINK-26134 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-27640) Flink not compiling, flink-connector-hive_2.12 is missing pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde
Piotr Nowojski created FLINK-27640: -- Summary: Flink not compiling, flink-connector-hive_2.12 is missing pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde Key: FLINK-27640 URL: https://issues.apache.org/jira/browse/FLINK-27640 Project: Flink Issue Type: Bug Components: Build System, Connectors / Hive Affects Versions: 1.16.0 Reporter: Piotr Nowojski When clean installing whole project after cleaning local {{.m2}} directory I encountered the following error when compiling flink-connector-hive_2.12: {noformat} [ERROR] Failed to execute goal on project flink-connector-hive_2.12: Could not resolve dependencies for project org.apache.flink:flink-connector-hive_2.12:jar:1.16-SNAPSHOT: Failed to collect dependencies at org.apache.hive:hive-exec:jar:2.3.9 -> org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Failed to read artifact descriptor for org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Could not transfer artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde from/to maven-default-http-blocker (http://0.0.0.0/): Blocked mirror for repositories: [conjars (http://conjars.org/repo, default, releases+snapshots), apache.snapshots (http://repository.apache.org/snapshots, default, snapshots)] -> [Help 1] {noformat} I've solved this by adding {noformat} spring-repo-plugins https://repo.spring.io/ui/native/plugins-release/ {noformat} to ~/.m2/settings.xml file. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27209) Half the Xmx and double the forkCount for unit tests
Piotr Nowojski created FLINK-27209: -- Summary: Half the Xmx and double the forkCount for unit tests Key: FLINK-27209 URL: https://issues.apache.org/jira/browse/FLINK-27209 Project: Flink Issue Type: Improvement Components: Build System Affects Versions: 1.16.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.16.0 As per recent "Speeding up the test builds" discussion on the dev mailing. I'm proposing to half the memory allocated for running unit tests (as defined per {{**/*Test.*}} property) but at the same time double the forkCounts for those tests. The premise is that they shouldn't need as much memory as ITCases to be stable, while increasing number of forks, should provide us with a couple of minutes improved build times. CC [~chesnay] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27202) NullPointerException on stop-with-savepoint with AsyncWaitOperator followed by FlinkKafkaProducer
Piotr Nowojski created FLINK-27202: -- Summary: NullPointerException on stop-with-savepoint with AsyncWaitOperator followed by FlinkKafkaProducer Key: FLINK-27202 URL: https://issues.apache.org/jira/browse/FLINK-27202 Project: Flink Issue Type: Bug Components: Connectors / Kafka, Runtime / Task Affects Versions: 1.13.6, 1.12.7 Reporter: Piotr Nowojski Some lingering mails from {{AsyncWaitOperator}} (or other operators using mailbox, or maybe even processing time timers), that are chained with {{FlinkKafkaProducer}} can cause the following exceptions when using stop-with-savepoint: {noformat} 2022-04-11 15:46:19,781 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - static enrichment -> Map -> Sink: enriched events sink (179/256) (3fefa588ad05fa8d2a10a6ad4a740cc6) switched from RUNNING to FAILED on 10.239.104.67:38149-12df6c @ 10.239.104.67 (dataPort=35745). java.lang.NullPointerException: null at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction$TransactionHolder.access$000(TwoPhaseCommitSinkFunction.java:591) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:223) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:54) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:50) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.async.queue.StreamRecordQueueEntry.emitResult(StreamRecordQueueEntry.java:64) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.async.queue.UnorderedStreamElementQueue$Segment.emitCompleted(UnorderedStreamElementQueue.java:272) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.async.queue.UnorderedStreamElementQueue.emitCompletedElement(UnorderedStreamElementQueue.java:159) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:287) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:356) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:337) ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2] at
[jira] [Created] (FLINK-27133) Performance regression in serializerHeavyString
Piotr Nowojski created FLINK-27133: -- Summary: Performance regression in serializerHeavyString Key: FLINK-27133 URL: https://issues.apache.org/jira/browse/FLINK-27133 Project: Flink Issue Type: Bug Components: API / Type Serialization System, Benchmarks Affects Versions: 1.15.0, 1.16.0 Reporter: Piotr Nowojski http://codespeed.dak8s.net:8000/timeline/#/?exe=1=serializerHeavyString=on=on=off=2=200 Suspected range: 5f21d15a09..caa296b813 I've run a benchmark request before FLINK-26957 and it suggests that it is indeed the cause for this regression: http://codespeed.dak8s.net:8080/job/flink-benchmark-request/77/artifact/jmh-result.csv/*view*/ CC [~mapohl] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27117) FLIP-217: Support watermark alignment of source splits
Piotr Nowojski created FLINK-27117: -- Summary: FLIP-217: Support watermark alignment of source splits Key: FLINK-27117 URL: https://issues.apache.org/jira/browse/FLINK-27117 Project: Flink Issue Type: New Feature Components: API / DataStream, Runtime / Task Affects Versions: 1.14.4, 1.13.6, 1.15.0 Reporter: Piotr Nowojski https://cwiki.apache.org/confluence/display/FLINK/FLIP-217+Support+watermark+alignment+of+source+splits -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26864) Performance regression on 25.03.2022
Piotr Nowojski created FLINK-26864: -- Summary: Performance regression on 25.03.2022 Key: FLINK-26864 URL: https://issues.apache.org/jira/browse/FLINK-26864 Project: Flink Issue Type: Bug Components: Benchmarks Affects Versions: 1.16.0 Reporter: Piotr Nowojski http://codespeed.dak8s.net:8000/timeline/#/?exe=1=arrayKeyBy=on=on=off=2=200 http://codespeed.dak8s.net:8000/timeline/#/?exe=1=remoteFilePartition=on=on=off=2=200 http://codespeed.dak8s.net:8000/timeline/#/?exe=1=remoteSortPartition=on=on=off=2=200 http://codespeed.dak8s.net:8000/timeline/#/?exe=1=tupleKeyBy=on=on=off=2=200 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26682) Migrate regression check script to python and to main repository
Piotr Nowojski created FLINK-26682: -- Summary: Migrate regression check script to python and to main repository Key: FLINK-26682 URL: https://issues.apache.org/jira/browse/FLINK-26682 Project: Flink Issue Type: New Feature Components: Benchmarks Affects Versions: 1.15.0 Reporter: Piotr Nowojski Script provided by Roman https://github.com/rkhachatryan/flink-benchmarks/blob/regression-alerts/regression-alert.sh should be merge to the main repo and migrated to python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26147) Deleting incremental savepoints directory in CLAIM mode
Piotr Nowojski created FLINK-26147: -- Summary: Deleting incremental savepoints directory in CLAIM mode Key: FLINK-26147 URL: https://issues.apache.org/jira/browse/FLINK-26147 Project: Flink Issue Type: Sub-task Components: Runtime / Checkpointing Affects Versions: 1.15.0 Reporter: Piotr Nowojski -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26146) Add test coverage for native format flink upgrades (minor versions)
Piotr Nowojski created FLINK-26146: -- Summary: Add test coverage for native format flink upgrades (minor versions) Key: FLINK-26146 URL: https://issues.apache.org/jira/browse/FLINK-26146 Project: Flink Issue Type: Sub-task Components: Runtime / Checkpointing, Tests Reporter: Piotr Nowojski Check test coverage for: Flink minor (1.x → 1.y) version upgrade Parametrised SavepointMigrationTestBase to work on: canonical savepoints native savepoints aligned (retained) checkpoints Assignee should contact release managers and decided whether to already create 1.15 (snapshot) artefacts for native savepoints and aligned checkpoints. If we do so, we might make release manager work more complicated. However if we don’t, the test will be a dead code until release manager creates those artefacts, which can also make the release manager work more difficult if test explodes during RC creation. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26064) KinesisFirehoseSinkITCase IllegalStateException: Trying to access closed classloader
Piotr Nowojski created FLINK-26064: -- Summary: KinesisFirehoseSinkITCase IllegalStateException: Trying to access closed classloader Key: FLINK-26064 URL: https://issues.apache.org/jira/browse/FLINK-26064 Project: Flink Issue Type: Bug Components: Connectors / Kinesis Affects Versions: 1.15.0 Reporter: Piotr Nowojski https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=31044=logs=d44f43ce-542c-597d-bf94-b0718c71e5e8=ed165f3f-d0f6-524b-5279-86f8ee7d0e2d (shortened stack trace, as full is too large) {noformat} Feb 09 20:05:04 java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Trying to access closed classloader. Please check if you store classloaders directly or indirectly in static fields. If the stacktrace suggests that the leak occurs in a third party library and cannot be fixed immediately, you can disable this check with the configuration 'classloader.check-leaked-classloader'. Feb 09 20:05:04 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) Feb 09 20:05:04 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) (...) Feb 09 20:05:04 Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Trying to access closed classloader. Please check if you store classloaders directly or indirectly in static fields. If the stacktrace suggests that the leak occurs in a third party library and cannot be fixed immediately, you can disable this check with the configuration 'classloader.check-leaked-classloader'. Feb 09 20:05:04 at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98) Feb 09 20:05:04 at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43) Feb 09 20:05:04 at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204) Feb 09 20:05:04 at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200) Feb 09 20:05:04 at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179) Feb 09 20:05:04 at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159) (...) Feb 09 20:05:04 Caused by: java.lang.IllegalStateException: Trying to access closed classloader. Please check if you store classloaders directly or indirectly in static fields. If the stacktrace suggests that the leak occurs in a third party library and cannot be fixed immediately, you can disable this check with the configuration 'classloader.check-leaked-classloader'. Feb 09 20:05:04 at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:164) Feb 09 20:05:04 at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.getResources(FlinkUserCodeClassLoaders.java:188) Feb 09 20:05:04 at java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:348) Feb 09 20:05:04 at java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393) Feb 09 20:05:04 at java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474) Feb 09 20:05:04 at javax.xml.stream.FactoryFinder$1.run(FactoryFinder.java:352) Feb 09 20:05:04 at java.security.AccessController.doPrivileged(Native Method) Feb 09 20:05:04 at javax.xml.stream.FactoryFinder.findServiceProvider(FactoryFinder.java:341) Feb 09 20:05:04 at javax.xml.stream.FactoryFinder.find(FactoryFinder.java:313) Feb 09 20:05:04 at javax.xml.stream.FactoryFinder.find(FactoryFinder.java:227) Feb 09 20:05:04 at javax.xml.stream.XMLInputFactory.newInstance(XMLInputFactory.java:154) Feb 09 20:05:04 at software.amazon.awssdk.protocols.query.unmarshall.XmlDomParser.createXmlInputFactory(XmlDomParser.java:124) Feb 09 20:05:04 at java.lang.ThreadLocal$SuppliedThreadLocal.initialValue(ThreadLocal.java:284) Feb 09 20:05:04 at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:180) Feb 09 20:05:04 at java.lang.ThreadLocal.get(ThreadLocal.java:170) (...) {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25982) Support idleness with watermark alignment
Piotr Nowojski created FLINK-25982: -- Summary: Support idleness with watermark alignment Key: FLINK-25982 URL: https://issues.apache.org/jira/browse/FLINK-25982 Project: Flink Issue Type: Sub-task Reporter: Piotr Nowojski -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25914) Performance regression in checkpointSingleInput.UNALIGNED on 31.01.2022
Piotr Nowojski created FLINK-25914: -- Summary: Performance regression in checkpointSingleInput.UNALIGNED on 31.01.2022 Key: FLINK-25914 URL: https://issues.apache.org/jira/browse/FLINK-25914 Project: Flink Issue Type: Bug Components: Benchmarks, Runtime / Checkpointing Affects Versions: 1.15.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski http://codespeed.dak8s.net:8000/timeline/?ben=checkpointSingleInput.UNALIGNED=2 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25912) HashMapStateBackend performance regression on 31.01.2022
Piotr Nowojski created FLINK-25912: -- Summary: HashMapStateBackend performance regression on 31.01.2022 Key: FLINK-25912 URL: https://issues.apache.org/jira/browse/FLINK-25912 Project: Flink Issue Type: Bug Components: Benchmarks, Runtime / State Backends Reporter: Piotr Nowojski Fix For: 1.15.0 http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=mapEntries.HEAP=2=200=off=on=on http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=mapIterator.HEAP=2=200=off=on=on http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=mapKeys.HEAP=2=200=off=on=on http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=mapValues.HEAP=2=200=off=on=on -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25896) Buffer debloating issues with high parallelism
Piotr Nowojski created FLINK-25896: -- Summary: Buffer debloating issues with high parallelism Key: FLINK-25896 URL: https://issues.apache.org/jira/browse/FLINK-25896 Project: Flink Issue Type: Sub-task Components: Runtime / Network Affects Versions: 1.14.3, 1.15.0 Reporter: Piotr Nowojski -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25869) Create a stress test cron build
Piotr Nowojski created FLINK-25869: -- Summary: Create a stress test cron build Key: FLINK-25869 URL: https://issues.apache.org/jira/browse/FLINK-25869 Project: Flink Issue Type: Improvement Components: Test Infrastructure, Tests Affects Versions: 1.15.0 Reporter: Piotr Nowojski Fix For: 1.15.0 We propose creating some kind of stress test cron job dedicated for periodically running longer lasting stress tests. For example trying to expose memory leaks. First idea to test could be to provide test coverage for FLINK-25728. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25745) Support RocksDB incremental native savepoints
Piotr Nowojski created FLINK-25745: -- Summary: Support RocksDB incremental native savepoints Key: FLINK-25745 URL: https://issues.apache.org/jira/browse/FLINK-25745 Project: Flink Issue Type: Sub-task Components: Runtime / State Backends Reporter: Piotr Nowojski Fix For: 1.15.0 Respect CheckpointType.SharingFilesStrategy#NO_SHARING flag in RocksIncrementalSnapshotStrategy. We also need to make sure that RocksDBIncrementalSnapshotStrategy is creating self contained/relocatable snapshots (using CheckpointedStateScope#EXCLUSIVE for native savepoints) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25744) Support native savepoints (w/o modifying the statebackend specific snapshot strategies)
Piotr Nowojski created FLINK-25744: -- Summary: Support native savepoints (w/o modifying the statebackend specific snapshot strategies) Key: FLINK-25744 URL: https://issues.apache.org/jira/browse/FLINK-25744 Project: Flink Issue Type: Sub-task Components: Runtime / Checkpointing Reporter: Piotr Nowojski Assignee: Dawid Wysakowicz For example w/o incremental RocksDB support. But HashMap and Full RocksDB should be working out of the box w/o extra changes. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25704) Performance regression on 18.01.2022 in batch network benchmarks
Piotr Nowojski created FLINK-25704: -- Summary: Performance regression on 18.01.2022 in batch network benchmarks Key: FLINK-25704 URL: https://issues.apache.org/jira/browse/FLINK-25704 Project: Flink Issue Type: Improvement Components: Runtime / Network Affects Versions: 1.15.0 Reporter: Piotr Nowojski http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=compressedFilePartition=2=200=off=on=on http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=uncompressedFilePartition=2=200=off=on=on http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=uncompressedMmapPartition=2=200=off=on=on Suspected range: {code} git ls eeec246677..f5c99c6f26 f5c99c6f26 [5 weeks ago] [FLINK-17321][table] Add support casting of map to map and multiset to multiset [Sergey Nuyanzin] 745cfec705 [24 hours ago] [hotfix][table-common] Fix InternalDataUtils for MapData tests [Timo Walther] ed699b6ee6 [6 days ago] [FLINK-25637][network] Make sort-shuffle the default shuffle implementation for batch jobs [kevin.cyj] 4275525fed [6 days ago] [FLINK-25638][network] Increase the default write buffer size of sort-shuffle to 16M [kevin.cyj] e1878fb899 [6 days ago] [FLINK-25639][network] Increase the default read buffer size of sort-shuffle to 64M [kevin.cyj] {code} It looks [~kevin.cyj], that most likely your change has caused that? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25688) Resolve performance degradation with high parallelism when using buffer debloating
Piotr Nowojski created FLINK-25688: -- Summary: Resolve performance degradation with high parallelism when using buffer debloating Key: FLINK-25688 URL: https://issues.apache.org/jira/browse/FLINK-25688 Project: Flink Issue Type: Improvement Components: Runtime / Network Affects Versions: 1.14.3, 1.15.0 Reporter: Piotr Nowojski As documented in FLINK-25646, currently buffer debloating in Flink, at least in the default configuration, has quite noticeable performance degradation at larger scale. For example throughput can drop by a factor of 4, or even checkpointing times can be increased. Currently it's not clear why is this happening. It looks like increasing the number of buffers per channel from the default ~2 to above 3 (for example via bumping number of floating buffers to value equal or higher then parallelism), seems to be solving this problem, at least on one cluster where buffer debloating has been tested at large scale. Further investigation is required. CC [~akalashnikov] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25414) Provide metrics to measure how long task has been blocked
Piotr Nowojski created FLINK-25414: -- Summary: Provide metrics to measure how long task has been blocked Key: FLINK-25414 URL: https://issues.apache.org/jira/browse/FLINK-25414 Project: Flink Issue Type: New Feature Components: Runtime / Metrics, Runtime / Task Affects Versions: 1.14.2 Reporter: Piotr Nowojski Currently back pressured/busy metrics tell the user whether task is blocked/busy and how much % of the time it is blocked/busy. But they do not tell how for how long single block event is lasting. It can be 1ms or 1h and back pressure/busy would be still reporting 100%. In order to improve this, we could provide two new metrics: # maxSoftBackPressureDuration # maxHardBackPressureDuration The max would be reset to 0 periodically or on every access to the metric (via metric reporter). Soft back pressure would be if task is back pressured in a non blocking fashion (StreamTask detected in availability of the output). Hard back pressure would measure the time task is actually blocked. Unfortunately I don't know how to efficiently provide similar metric for busy time, without impacting max throughput. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25407) Network stack deadlock when cancellation happens during initialisation
Piotr Nowojski created FLINK-25407: -- Summary: Network stack deadlock when cancellation happens during initialisation Key: FLINK-25407 URL: https://issues.apache.org/jira/browse/FLINK-25407 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.14.0, 1.15.0 Reporter: Piotr Nowojski {noformat} Java stack information for the threads listed above: === "Canceler for Source: Custom Source -> Filter (7/12)#14176 (0fbb8a89616ca7a40e473adad51f236f).": at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:420) - waiting to lock <0x82937f28> (a java.lang.Object) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567) at org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264) at org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276) at org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999) at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138) at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669) at java.lang.Thread.run(Thread.java:748) "Canceler for Map -> Map (6/12)#14176 (6195862d199aa4d52c12f25b39904725).": at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:585) - waiting to lock <0x97108898> (a java.util.ArrayDeque) at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:544) at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:424) - locked <0x82937f28> (a java.lang.Object) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567) at org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264) at org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276) at org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999) at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138) at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669) at java.lang.Thread.run(Thread.java:748) "Map -> Sink: Unnamed (7/12)#14176": at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.recycleMemorySegments(NetworkBufferPool.java:256) - waiting to lock <0x82937f28> (a java.lang.Object) at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.ja at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegmentsBlocking(NetworkBufferPool.ja at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247) - locked <0x97108898> (a java.util.ArrayDeque) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:497) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:276) at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:105) at org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:965) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:652) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) at java.lang.Thread.run(Thread.java:748) Found 1 deadlock. {noformat} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28297=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7=19003 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28306=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba=19832 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25382) Failure in "Upload Logs" task
Piotr Nowojski created FLINK-25382: -- Summary: Failure in "Upload Logs" task Key: FLINK-25382 URL: https://issues.apache.org/jira/browse/FLINK-25382 Project: Flink Issue Type: Bug Components: Test Infrastructure Affects Versions: 1.15.0 Reporter: Piotr Nowojski I don't see any error message, but it seems like uploading the logs has failed: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=bb16d35c-fdfe-5139-f244-9492cbd2050b for the following build: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=2c7d57b9-7341-5a87-c9af-2cf7cc1a37dc -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25276) Support native and incremental savepoints
Piotr Nowojski created FLINK-25276: -- Summary: Support native and incremental savepoints Key: FLINK-25276 URL: https://issues.apache.org/jira/browse/FLINK-25276 Project: Flink Issue Type: New Feature Reporter: Piotr Nowojski Motivation. Currently with non incremental canonical format savepoints, with very large state, both taking and recovery from savepoints can take very long time. Providing options to take native format and incremental savepoint would alleviate this problem. In the past the main challenge lied in the ownership semantic and files clean up of such incremental savepoints. However with FLINK-25154 implemented some of those concerns can be solved. Incremental savepoint could leverage "force full snapshot" mode provided by FLINK-25192, to duplicate/copy all of the savepoint files out of the Flink's ownership scope. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25275) Weighted KeyGroup assignment
Piotr Nowojski created FLINK-25275: -- Summary: Weighted KeyGroup assignment Key: FLINK-25275 URL: https://issues.apache.org/jira/browse/FLINK-25275 Project: Flink Issue Type: New Feature Components: Runtime / Network Affects Versions: 1.14.0 Reporter: Piotr Nowojski Currently key groups are split into key group ranges naively in the simplest way. Key groups are split into equally sized continuous ranges (number of ranges = parallelism = number of keygroups / size of single keygroup). Flink could avoid data skew between key groups, by assigning them to tasks based on their "weight". "Weight" could be defined as frequency of an access for the given key group. Arbitrary, non-continuous, key group assignment (for example TM1 is processing kg1 and kg3 while TM2 is processing only kg2) would require extensive changes to the state backends for example. However the data skew could be mitigated to some extent by creating key group ranges in a more clever way, while keeping the key group range continuity. For example TM1 processes range [kg1, kg9], while TM2 just [kg10, kg11]. [This branch shows a PoC of such approach.|https://github.com/pnowojski/flink/commits/antiskew] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25255) Consider/design implementing State Processor API (FC)
Piotr Nowojski created FLINK-25255: -- Summary: Consider/design implementing State Processor API (FC) Key: FLINK-25255 URL: https://issues.apache.org/jira/browse/FLINK-25255 Project: Flink Issue Type: Sub-task Reporter: Piotr Nowojski -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-24919) UnalignedCheckpointITCase hangs on Azure
Piotr Nowojski created FLINK-24919: -- Summary: UnalignedCheckpointITCase hangs on Azure Key: FLINK-24919 URL: https://issues.apache.org/jira/browse/FLINK-24919 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.15.0 Reporter: Piotr Nowojski Extracted from FLINK-23466 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26304=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7=13067 Nov 10 16:13:03 Starting org.apache.flink.test.checkpointing.UnalignedCheckpointITCase#execute[pipeline with mixed channels, p = 20, timeout = 0, buffersPerChannel = 1]. >From the log, we can see this case hangs. I guess this seems a new issue which >is different from the one reported in this ticket. From the stack, it seems >there is something wrong with the checkpoint coordinator, the following thread >locked 0x87db4fb8: {code:java} 2021-11-10T17:14:21.0899474Z Nov 10 17:14:21 "jobmanager-io-thread-2" #12984 daemon prio=5 os_prio=0 tid=0x7f12e000b800 nid=0x3fb6 runnable [0x7f0fcd6d4000] 2021-11-10T17:14:21.0899924Z Nov 10 17:14:21java.lang.Thread.State: RUNNABLE 2021-11-10T17:14:21.0900300Z Nov 10 17:14:21at java.util.HashMap$TreeNode.balanceDeletion(HashMap.java:2338) 2021-11-10T17:14:21.0900745Z Nov 10 17:14:21at java.util.HashMap$TreeNode.removeTreeNode(HashMap.java:2112) 2021-11-10T17:14:21.0901146Z Nov 10 17:14:21at java.util.HashMap.removeNode(HashMap.java:840) 2021-11-10T17:14:21.0901577Z Nov 10 17:14:21at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:301) 2021-11-10T17:14:21.0902002Z Nov 10 17:14:21at java.util.HashMap.putVal(HashMap.java:664) 2021-11-10T17:14:21.0902531Z Nov 10 17:14:21at java.util.HashMap.putMapEntries(HashMap.java:515) 2021-11-10T17:14:21.0902931Z Nov 10 17:14:21at java.util.HashMap.putAll(HashMap.java:785) 2021-11-10T17:14:21.0903429Z Nov 10 17:14:21at org.apache.flink.runtime.checkpoint.ExecutionAttemptMappingProvider.getVertex(ExecutionAttemptMappingProvider.java:60) 2021-11-10T17:14:21.0904060Z Nov 10 17:14:21at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.reportStats(CheckpointCoordinator.java:1867) 2021-11-10T17:14:21.0904686Z Nov 10 17:14:21at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1152) 2021-11-10T17:14:21.0905372Z Nov 10 17:14:21- locked <0x87db4fb8> (a java.lang.Object) 2021-11-10T17:14:21.0905895Z Nov 10 17:14:21at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) 2021-11-10T17:14:21.0906493Z Nov 10 17:14:21at org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1368/705813936.accept(Unknown Source) 2021-11-10T17:14:21.0907086Z Nov 10 17:14:21at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) 2021-11-10T17:14:21.0907698Z Nov 10 17:14:21at org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1369/1447418658.run(Unknown Source) 2021-11-10T17:14:21.0908210Z Nov 10 17:14:21at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2021-11-10T17:14:21.0908735Z Nov 10 17:14:21at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2021-11-10T17:14:21.0909333Z Nov 10 17:14:21at java.lang.Thread.run(Thread.java:748) {code} But other thread is waiting for the lock. I am not familiar with these logics and not sure if this is in the right state. Could anyone who is familiar with these logics take a look? BTW, concurrent access of HashMap may cause infinite loop,I see in the stack that there are multiple threads are accessing HashMap, though I am not sure if they are the same instance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint
Piotr Nowojski created FLINK-24846: -- Summary: AsyncWaitOperator fails during stop-with-savepoint Key: FLINK-24846 URL: https://issues.apache.org/jira/browse/FLINK-24846 Project: Flink Issue Type: Bug Components: Runtime / Task Reporter: Piotr Nowojski Attachments: log-jm.txt {noformat} Caused by: org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException: Mailbox is in state QUIESCED, but is required to be in state OPEN for put operations. at org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) ~[flink-dist_2.11-1.14.0.jar:1.14.0] at java.lang.Thread.run(Thread.java:829) ~[?:?] {noformat} As reported by a user on [the mailing list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E] {quote} I failed to stop a job with savepoint with the following message: Inconsistent execution state after stopping with savepoint. At least one execution is still in one of the following states: FAILED, CANCELED. A global fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b. The job manager said A savepoint was created at hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate successfully. while complaining about Mailbox is in state QUIESCED, but is required to be in state OPEN for put operations. Is it okay to ignore this kind of error? Please see the attached files for the detailed context. FYI, - I used the latest 1.14.0 - I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job - I couldn't reproduce the exception using the same jar so I might not able to provide DUBUG messages {quote} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-24696) Translate how to configure unaligned checkpoints into Chinese
Piotr Nowojski created FLINK-24696: -- Summary: Translate how to configure unaligned checkpoints into Chinese Key: FLINK-24696 URL: https://issues.apache.org/jira/browse/FLINK-24696 Project: Flink Issue Type: Improvement Components: chinese-translation, Documentation Affects Versions: 1.15.0, 1.14.1 Reporter: Piotr Nowojski Fix For: 1.15.0 Page {{content.zh/docs/ops/state/checkpointing_under_backpressure.md}} needs to be translated. Reference english version ticket: FLINK-24670 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24695) Update how to configure unaligned checkpoints in the documentation
Piotr Nowojski created FLINK-24695: -- Summary: Update how to configure unaligned checkpoints in the documentation Key: FLINK-24695 URL: https://issues.apache.org/jira/browse/FLINK-24695 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 1.15.0, 1.14.1 Reporter: Piotr Nowojski Assignee: Piotr Nowojski It looks like we don't have a code example how to enabled unaligned checkpoints anywhere in the docs. That should be fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24694) Translate "Checkpointing under backpressure" page into Chinese
Piotr Nowojski created FLINK-24694: -- Summary: Translate "Checkpointing under backpressure" page into Chinese Key: FLINK-24694 URL: https://issues.apache.org/jira/browse/FLINK-24694 Project: Flink Issue Type: Improvement Components: chinese-translation, Documentation Affects Versions: 1.15.0, 1.14.1 Reporter: Piotr Nowojski Page {{content.zh/docs/ops/state/checkpoints.md}} is very much out of date and out of sync of it's english counterpart. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24693) Update Chinese version of "Checkpoints" page
Piotr Nowojski created FLINK-24693: -- Summary: Update Chinese version of "Checkpoints" page Key: FLINK-24693 URL: https://issues.apache.org/jira/browse/FLINK-24693 Project: Flink Issue Type: Improvement Components: chinese-translation, Documentation Affects Versions: 1.15.0, 1.14.1 Reporter: Piotr Nowojski Page {{content.zh/docs/ops/state/checkpoints.md}} is very much out of date and out of sync of it's english counterpart. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24670) Restructure unaligned checkpoints documentation page to "Checkpointing under back pressure"
Piotr Nowojski created FLINK-24670: -- Summary: Restructure unaligned checkpoints documentation page to "Checkpointing under back pressure" Key: FLINK-24670 URL: https://issues.apache.org/jira/browse/FLINK-24670 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 1.14.1 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.15.0, 1.14.1 https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/unaligned_checkpoints/ * Should be renamed and restructured to "Checkpointing under back pressure" * Should have a short section that cross references network mem tuning guide * It looks like we don't have a code example how to enabled unaligned checkpoints anywhere in the docs? That should be fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24531) CheckpointingTimeBenchmark can fail with connection timed out
Piotr Nowojski created FLINK-24531: -- Summary: CheckpointingTimeBenchmark can fail with connection timed out Key: FLINK-24531 URL: https://issues.apache.org/jira/browse/FLINK-24531 Project: Flink Issue Type: Bug Components: Benchmarks Affects Versions: 1.15.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski There is a race condition between obtaining REST API port and actually starting up the REST API server. If REST API server start is delayed, the {{CheckpointingTimeBenchmark}} can fail with {noformat} java.util.concurrent.ExecutionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: localhost/127.0.0.1:60503 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.flink.benchmark.CheckpointingTimeBenchmark$CheckpointEnvironmentContext.waitForBackpressure(CheckpointingTimeBenchmark.java:251) at org.apache.flink.benchmark.CheckpointingTimeBenchmark$CheckpointEnvironmentContext.setUp(CheckpointingTimeBenchmark.java:216) at org.apache.flink.benchmark.generated.CheckpointingTimeBenchmark_checkpointSingleInput_jmhTest._jmh_tryInit_f_checkpointenvironmentcont {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24483) Document what is Public API and what compatibility guarantees Flink is providing
Piotr Nowojski created FLINK-24483: -- Summary: Document what is Public API and what compatibility guarantees Flink is providing Key: FLINK-24483 URL: https://issues.apache.org/jira/browse/FLINK-24483 Project: Flink Issue Type: Improvement Components: API / DataStream, Documentation, Table SQL / API Affects Versions: 1.13.2, 1.12.5, 1.14.0 Reporter: Piotr Nowojski Fix For: 1.15.0 We should document: * What constitute of the Public API, what do Public/PublicEvolving/Experimental/Internal annotations mean. * What compatibility guarantees we are providing (forward functional, compile and binary compatibility for {{@Public}} interfaces) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24475) Remove no longer used NestedMap* classes
Piotr Nowojski created FLINK-24475: -- Summary: Remove no longer used NestedMap* classes Key: FLINK-24475 URL: https://issues.apache.org/jira/browse/FLINK-24475 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.13.2, 1.14.0, 1.15.0 Reporter: Piotr Nowojski Fix For: 1.15.0 After FLINK-21935 all of the {{NestedMapsStateTable}} classes are no longer used in the production code. They are still however being used in benchmarks in some tests. Benchmarks/tests should be migrated to {{CopyOnWrite}} versions while the {{NestedMaps}} classes should be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24472) DispatcherTest is unstable
Piotr Nowojski created FLINK-24472: -- Summary: DispatcherTest is unstable Key: FLINK-24472 URL: https://issues.apache.org/jira/browse/FLINK-24472 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.15.0 Reporter: Piotr Nowojski Fix For: 1.15.0 testCancellationOfNonCanceledTerminalJobFailsWithAppropriateException from DispatcherTest can fail with: {noformat} Oct 07 10:31:18 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) Oct 07 10:31:18 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) Oct 07 10:31:18 at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) Oct 07 10:31:18 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) Oct 07 10:31:18 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) Oct 07 10:31:18 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) Oct 07 10:31:18 at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) Oct 07 10:31:18 at org.junit.runners.ParentRunner.run(ParentRunner.java:413) Oct 07 10:31:18 at org.junit.runner.JUnitCore.run(JUnitCore.java:137) Oct 07 10:31:18 at org.junit.runner.JUnitCore.run(JUnitCore.java:115) Oct 07 10:31:18 at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43) Oct 07 10:31:18 at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) Oct 07 10:31:18 at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) Oct 07 10:31:18 at java.util.Iterator.forEachRemaining(Iterator.java:116) Oct 07 10:31:18 at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) Oct 07 10:31:18 at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) Oct 07 10:31:18 at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) Oct 07 10:31:18 at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) Oct 07 10:31:18 at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) Oct 07 10:31:18 at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) Oct 07 10:31:18 at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) Oct 07 10:31:18 at org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82) Oct 07 10:31:18 at org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73) Oct 07 10:31:18 at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220) Oct 07 10:31:18 at org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188) Oct 07 10:31:18 at org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202) Oct 07 10:31:18 at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181) Oct 07 10:31:18 at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128) Oct 07 10:31:18 at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150) Oct 07 10:31:18 at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:120) Oct 07 10:31:18 at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) Oct 07 10:31:18 at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) Oct 07 10:31:18 at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) Oct 07 10:31:18 at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24471) Add microbenchmark for throughput with debloating enabled
Piotr Nowojski created FLINK-24471: -- Summary: Add microbenchmark for throughput with debloating enabled Key: FLINK-24471 URL: https://issues.apache.org/jira/browse/FLINK-24471 Project: Flink Issue Type: Sub-task Components: Benchmarks, Runtime / Network Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.15.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24439) Introduce a CoordinatorStore
Piotr Nowojski created FLINK-24439: -- Summary: Introduce a CoordinatorStore Key: FLINK-24439 URL: https://issues.apache.org/jira/browse/FLINK-24439 Project: Flink Issue Type: Sub-task Components: Connectors / Common Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.15.0 In order to allow {{SourceCoordinator}}s from different {{Source}}s (for example two different Kafka sources, or Kafka and Kinesis) to align watermarks, they have to be able to exchange information/aggregate watermarks from those different Sources. To enable this, we need to provide some {{CoordinatorStore}} concept, that would be a thread safe singleton. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24441) Block SourceReader when watermarks are out of alignment
Piotr Nowojski created FLINK-24441: -- Summary: Block SourceReader when watermarks are out of alignment Key: FLINK-24441 URL: https://issues.apache.org/jira/browse/FLINK-24441 Project: Flink Issue Type: Sub-task Components: Connectors / Common Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.15.0 SourceReader should become unavailable once it's latest watermark is too far into the future -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24440) Announce and combine latest watermarks across SourceReaders
Piotr Nowojski created FLINK-24440: -- Summary: Announce and combine latest watermarks across SourceReaders Key: FLINK-24440 URL: https://issues.apache.org/jira/browse/FLINK-24440 Project: Flink Issue Type: Sub-task Components: Connectors / Common Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.15.0 # Each SourceReader should inform it's SourceCoordinator about the latest watermark that it has emitted so far # SourceCoordinators should combine those watermarks and broadcast the aggregated result back to all SourceReaders -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24357) ZooKeeperLeaderElectionConnectionHandlingTest#testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled fails with an Unhandled error
Piotr Nowojski created FLINK-24357: -- Summary: ZooKeeperLeaderElectionConnectionHandlingTest#testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled fails with an Unhandled error Key: FLINK-24357 URL: https://issues.apache.org/jira/browse/FLINK-24357 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.15.0 Reporter: Piotr Nowojski In a [private azure|https://dev.azure.com/pnowojski/637f6470-2732-4605-a304-caebd40e284b/_apis/build/builds/517/logs/155] build when testing my own PR I've noticed the following error that looks unrelated to any of my changes (modifications to {{Task}} class error/cancellation handling logic): {noformat} 2021-09-22T08:09:16.6244936Z Sep 22 08:09:16 [ERROR] testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled Time elapsed: 28.753 s <<< FAILURE! 2021-09-22T08:09:16.6245821Z Sep 22 08:09:16 java.lang.AssertionError: The TestingFatalErrorHandler caught an exception. 2021-09-22T08:09:16.6246513Z Sep 22 08:09:16at org.apache.flink.runtime.util.TestingFatalErrorHandlerResource.after(TestingFatalErrorHandlerResource.java:78) 2021-09-22T08:09:16.6247281Z Sep 22 08:09:16at org.apache.flink.runtime.util.TestingFatalErrorHandlerResource.access$300(TestingFatalErrorHandlerResource.java:33) 2021-09-22T08:09:16.6248167Z Sep 22 08:09:16at org.apache.flink.runtime.util.TestingFatalErrorHandlerResource$1.evaluate(TestingFatalErrorHandlerResource.java:57) 2021-09-22T08:09:16.6248862Z Sep 22 08:09:16at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) 2021-09-22T08:09:16.6249620Z Sep 22 08:09:16at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) 2021-09-22T08:09:16.6250210Z Sep 22 08:09:16at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) 2021-09-22T08:09:16.6250773Z Sep 22 08:09:16at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 2021-09-22T08:09:16.6251375Z Sep 22 08:09:16at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) 2021-09-22T08:09:16.6251951Z Sep 22 08:09:16at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 2021-09-22T08:09:16.6252562Z Sep 22 08:09:16at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) 2021-09-22T08:09:16.6253415Z Sep 22 08:09:16at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) 2021-09-22T08:09:16.6254469Z Sep 22 08:09:16at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 2021-09-22T08:09:16.6255039Z Sep 22 08:09:16at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 2021-09-22T08:09:16.6256238Z Sep 22 08:09:16at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 2021-09-22T08:09:16.6257109Z Sep 22 08:09:16at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 2021-09-22T08:09:16.6257766Z Sep 22 08:09:16at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 2021-09-22T08:09:16.6258406Z Sep 22 08:09:16at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 2021-09-22T08:09:16.6259050Z Sep 22 08:09:16at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 2021-09-22T08:09:16.6259827Z Sep 22 08:09:16at org.junit.runner.JUnitCore.run(JUnitCore.java:137) 2021-09-22T08:09:16.6260963Z Sep 22 08:09:16at org.junit.runner.JUnitCore.run(JUnitCore.java:115) 2021-09-22T08:09:16.6261796Z Sep 22 08:09:16at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43) 2021-09-22T08:09:16.6262428Z Sep 22 08:09:16at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) 2021-09-22T08:09:16.6263268Z Sep 22 08:09:16at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) 2021-09-22T08:09:16.6263875Z Sep 22 08:09:16at java.util.Iterator.forEachRemaining(Iterator.java:116) 2021-09-22T08:09:16.6265025Z Sep 22 08:09:16at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) 2021-09-22T08:09:16.6265940Z Sep 22 08:09:16at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) 2021-09-22T08:09:16.6266767Z Sep 22 08:09:16at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) 2021-09-22T08:09:16.6267470Z Sep 22 08:09:16at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) 2021-09-22T08:09:16.6268165Z Sep 22 08:09:16at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) 2021-09-22T08:09:16.6269341Z Sep 22 08:09:16at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 2021-09-22T08:09:16.6269928Z Sep 22 08:09:16at
[jira] [Created] (FLINK-24344) Handling of IOExceptions when triggering checkpoints doesn't cause job failover
Piotr Nowojski created FLINK-24344: -- Summary: Handling of IOExceptions when triggering checkpoints doesn't cause job failover Key: FLINK-24344 URL: https://issues.apache.org/jira/browse/FLINK-24344 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.14.0 Reporter: Piotr Nowojski Fix For: 1.14.0, 1.15.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24328) Long term fix for receiving new buffer size before network reader configured
Piotr Nowojski created FLINK-24328: -- Summary: Long term fix for receiving new buffer size before network reader configured Key: FLINK-24328 URL: https://issues.apache.org/jira/browse/FLINK-24328 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.14.0 Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov Fix For: 1.14.0 It happened on the big cluster(parallelism = 75, TM=5, task=8) just on the initialization moment. {noformat} 2021-09-09 14:36:42,383 WARN org.apache.flink.runtime.taskmanager.Task [] - Map -> Flat Map (71/75)#0 (7a5b971e0cd57aa5d057a114e2679b03) switched from RUNNING to FAILED with failure c ause: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Fatal error at remote task manager 'ip-172-31-22-183.eu-central-1.compute.internal/172.31.22.183:42085'. at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.decodeMsg(CreditBasedPartitionRequestClientHandler.java:339) at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelRead(CreditBasedPartitionRequestClientHandler.java:240) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelRead(NettyMessageClientDecoderDelegate.java:112) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalStateException: No reader for receiverId = 296559f497c54a82534945f4549b9e2d exists. at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.obtainReader(PartitionRequestQueue.java:194) at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.notifyNewBufferSize(PartitionRequestQueue.java:188) at org.apache.flink.runtime.io.network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:134) at org.apache.flink.runtime.io.network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:42) at org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) at
[jira] [Created] (FLINK-24184) Potential race condition leading to incorrectly issued interruptions
Piotr Nowojski created FLINK-24184: -- Summary: Potential race condition leading to incorrectly issued interruptions Key: FLINK-24184 URL: https://issues.apache.org/jira/browse/FLINK-24184 Project: Flink Issue Type: Bug Components: Runtime / Task Affects Versions: 1.13.2, 1.12.5, 1.11.4, 1.10.3, 1.9.3, 1.8.3, 1.14.0 Reporter: Piotr Nowojski Fix For: 1.15.0 There is a race condition in disabling interrupts while closing resources. Currently this is guarded by a volatile variable, but there might be a race condition when: 1. interrupter thread first checked the shouldInterruptOnCancel flag 2. shouldInterruptOnCancel flag switched to false as Task/StreamTask entered cleaning up phase 3. interrupter issued an interrupt while Task/StreamTask are closing/releasing resources, potentially causing a memory leak -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24155) Translate documentation for how to configure the CheckpointFailureManager
Piotr Nowojski created FLINK-24155: -- Summary: Translate documentation for how to configure the CheckpointFailureManager Key: FLINK-24155 URL: https://issues.apache.org/jira/browse/FLINK-24155 Project: Flink Issue Type: Improvement Components: chinese-translation, Documentation Affects Versions: 1.13.2, 1.14.0, 1.15.0 Reporter: Piotr Nowojski Fix For: 1.14.0 Documentation added in FLINK-23916 should be translated to it's Chinese counterpart. Note that this applies to three separate commits: merged to master as cd01d4c0279 merged to release-1.14 as 2e769746bf2 merged to release-1.13 as e1a71219454 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24154) Translate documentation for how to configure the CheckpointFailureManager
Piotr Nowojski created FLINK-24154: -- Summary: Translate documentation for how to configure the CheckpointFailureManager Key: FLINK-24154 URL: https://issues.apache.org/jira/browse/FLINK-24154 Project: Flink Issue Type: Sub-task Components: chinese-translation, Documentation Reporter: Armstrong Nova Assignee: wangzhao Translate "[https://ci.apache.org/projects/flink/flink-docs-master/monitoring/historyserver.html]; page into Chinese. This doc located in "flink/docs/monitoring/historyserver.zh.md" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24030) PulsarSourceITCase>SourceTestSuiteBase.testMultipleSplits failed
Piotr Nowojski created FLINK-24030: -- Summary: PulsarSourceITCase>SourceTestSuiteBase.testMultipleSplits failed Key: FLINK-24030 URL: https://issues.apache.org/jira/browse/FLINK-24030 Project: Flink Issue Type: Bug Components: Connectors / Pulsar Affects Versions: 1.14.0 Reporter: Piotr Nowojski https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22936=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461 root cause: {noformat} Aug 27 09:41:42 Caused by: org.apache.pulsar.client.api.PulsarClientException$BrokerMetadataException: Consumer not found Aug 27 09:41:42 at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:987) Aug 27 09:41:42 at org.apache.pulsar.client.impl.PulsarClientImpl.close(PulsarClientImpl.java:658) Aug 27 09:41:42 at org.apache.flink.connector.pulsar.source.reader.source.PulsarSourceReaderBase.close(PulsarSourceReaderBase.java:83) Aug 27 09:41:42 at org.apache.flink.connector.pulsar.source.reader.source.PulsarOrderedSourceReader.close(PulsarOrderedSourceReader.java:170) Aug 27 09:41:42 at org.apache.flink.streaming.api.operators.SourceOperator.close(SourceOperator.java:308) Aug 27 09:41:42 at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:141) Aug 27 09:41:42 at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:127) Aug 27 09:41:42 at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1015) Aug 27 09:41:42 at org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:859) Aug 27 09:41:42 at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:747) Aug 27 09:41:42 at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958) Aug 27 09:41:42 at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) Aug 27 09:41:42 at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) Aug 27 09:41:42 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) {noformat} Top level error: {noformat} WARNING: The following warnings have been detected: WARNING: Return type, java.util.Map, of method, public java.util.Map org.apache.pulsar.broker.admin.impl.ClustersBase.getNamespaceIsolationPolicies(java.lang.String) throws java.lang.Exception, is not resolvable to a concrete type. Aug 27 09:41:42 [ERROR] Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 357.849 s <<< FAILURE! - in org.apache.flink.connector.pulsar.source.PulsarSourceITCase Aug 27 09:41:42 [ERROR] testMultipleSplits{TestEnvironment, ExternalContext}[1] Time elapsed: 5.391 s <<< ERROR! Aug 27 09:41:42 java.lang.RuntimeException: Failed to fetch next result Aug 27 09:41:42 at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:109) Aug 27 09:41:42 at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80) Aug 27 09:41:42 at org.apache.flink.connectors.test.common.utils.TestDataMatchers$MultipleSplitDataMatcher.matchesSafely(TestDataMatchers.java:151) Aug 27 09:41:42 at org.apache.flink.connectors.test.common.utils.TestDataMatchers$MultipleSplitDataMatcher.matchesSafely(TestDataMatchers.java:133) Aug 27 09:41:42 at org.hamcrest.TypeSafeDiagnosingMatcher.matches(TypeSafeDiagnosingMatcher.java:55) Aug 27 09:41:42 at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:12) Aug 27 09:41:42 at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8) Aug 27 09:41:42 at org.apache.flink.connectors.test.common.testsuites.SourceTestSuiteBase.testMultipleSplits(SourceTestSuiteBase.java:156) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23988) Test corner cases
Piotr Nowojski created FLINK-23988: -- Summary: Test corner cases Key: FLINK-23988 URL: https://issues.apache.org/jira/browse/FLINK-23988 Project: Flink Issue Type: Sub-task Components: Runtime / Network Reporter: Piotr Nowojski Check how debloating behaves in case of: * data skew * multiple/two/union inputs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23776) Performance regression on 14.08.2021 in FLIP-27
Piotr Nowojski created FLINK-23776: -- Summary: Performance regression on 14.08.2021 in FLIP-27 Key: FLINK-23776 URL: https://issues.apache.org/jira/browse/FLINK-23776 Project: Flink Issue Type: Bug Components: API / DataStream, Benchmarks Affects Versions: 1.14.0 Reporter: Piotr Nowojski Assignee: Arvid Heise Fix For: 1.14.0 http://codespeed.dak8s.net:8000/timeline/?ben=mapSink.F27_UNBOUNDED=2 http://codespeed.dak8s.net:8000/timeline/?ben=mapRebalanceMapSink.F27_UNBOUNDED=2 {noformat} git ls 7b60a964b1..7f3636f6b4 7f3636f6b4f [2 days ago] [FLINK-23652][connectors] Adding common source metrics. [Arvid Heise] 97c8f72b813 [3 months ago] [FLINK-23652][connectors] Adding common sink metrics. [Arvid Heise] 48da20e8f88 [3 months ago] [FLINK-23652][test] Adding InMemoryMetricReporter and using it by default in MiniClusterResource. [Arvid Heise] 63ee60859ca [3 months ago] [FLINK-23652][core/metrics] Extract Operator(IO)MetricGroup interfaces and expose them in RuntimeContext [Arvid Heise] 5d5e39b614b [2 days ago] [refactor][connectors] Only use MockSplitReader.Builder for instantiation. [Arvid Heise] b927035610c [3 months ago] [refactor][core] Extract common context creation in CollectionExecutor [Arvid Heise] {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23772) Double check if non-keyed FullyFinishedOperatorState can be mixed up with non finished OperatorState on recovery
Piotr Nowojski created FLINK-23772: -- Summary: Double check if non-keyed FullyFinishedOperatorState can be mixed up with non finished OperatorState on recovery Key: FLINK-23772 URL: https://issues.apache.org/jira/browse/FLINK-23772 Project: Flink Issue Type: Sub-task Components: Runtime / Checkpointing Affects Versions: 1.14.0 Reporter: Piotr Nowojski Fix For: 1.14.0 I'm not sure if with non-keyed state we have an issue that it can be reshuffled to different operators during recovery. Are there any guarantees that if subtask 1 has state A, while subtask 2 has B, that after recovery it won’t be rotated? # is this an issue? # if so, if we have partially finished tasks with some operators having, {{FullyFinishedOperatorState}}, what prevents {{VerticesFinishedCache.calculateIfFinished}} from failing if the {{FullyFinishedOperatorState}} gets assigned to an operator chain with non finished operator? For example an operator chain with parallelism of two, non-keyed, before recovery: {noformat} src1 (finished state) -> foo1 (finished state) src2 -> foo2 {noformat} Can we end up after recovery with: {noformat} src1 (finished state) -> foo2 src2 -> foo1 (finished state) {noformat} ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23771) FLIP-147 Checkpoint N has not been started at all
Piotr Nowojski created FLINK-23771: -- Summary: FLIP-147 Checkpoint N has not been started at all Key: FLINK-23771 URL: https://issues.apache.org/jira/browse/FLINK-23771 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.14.0 Reporter: Piotr Nowojski Fix For: 1.14.0 Observed when running WIP version of https://github.com/apache/flink/pull/16773/ {noformat} 6311 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - transform-2-keyed (4/4) (7f9e36a73792074afd7803e39f563fac) switched from RUNNING to FAILED on dac2cc5f-41f8-4067-99a9-ae56341e3735 @ localhost (dataPort=-1). java.lang.Exception: Could not perform checkpoint 4 for operator transform-2-keyed (4/4)#1. at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1170) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$11(StreamTask.java:1117) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:851) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:754) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:787) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:730) ~[classes/:?] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:786) ~[classes/:?] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:572) ~[classes/:?] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131] Caused by: java.lang.IllegalStateException: Checkpoint 4 has not been started at all at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.getAllBarriersReceivedFuture(SingleCheckpointBarrierHandler.java:464) ~[classes/:?] at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.getAllBarriersReceivedFuture(CheckpointedInputGate.java:209) ~[classes/:?] at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.prepareSnapshot(StreamTaskNetworkInput.java:112) ~[classes/:?] at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.prepareSnapshot(StreamOneInputProcessor.java:83) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.prepareInputSnapshot(StreamTask.java:458) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.prepareInflightDataSnapshot(SubtaskCheckpointCoordinatorImpl.java:554) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.initInputsCheckpoint(SubtaskCheckpointCoordinatorImpl.java:423) ~[classes/:?] at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1154) ~[classes/:?] ... 13 more {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23741) Waiting for final checkpoint can deadlock job
Piotr Nowojski created FLINK-23741: -- Summary: Waiting for final checkpoint can deadlock job Key: FLINK-23741 URL: https://issues.apache.org/jira/browse/FLINK-23741 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing, Runtime / Task Affects Versions: 1.14.0 Reporter: Piotr Nowojski With {{ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH}} enabled, final checkpoint can deadlock (or timeout after very long time) if there is a race condition between selecting tasks to trigger checkpoint on and finishing tasks. FLINK-21246 was supposed to handle it, but it doesn't work as expected, because futures from: org.apache.flink.runtime.taskexecutor.TaskExecutor#triggerCheckpoint and org.apache.flink.streaming.runtime.tasks.StreamTask#triggerCheckpointAsync are not linked together. TaskExecutor#triggerCheckpoint reports that checkpoint has been successfully triggered, while {{StreamTask}} might have actually finished. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23708) When task is finishedOnRestore Operators shouldn't be used
Piotr Nowojski created FLINK-23708: -- Summary: When task is finishedOnRestore Operators shouldn't be used Key: FLINK-23708 URL: https://issues.apache.org/jira/browse/FLINK-23708 Project: Flink Issue Type: Sub-task Components: Runtime / Task Affects Versions: 1.14.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.14.0 When task is finishedOnRestore Operators shouldn't be used, invoked or even initialised. Currently at least a couple of methods are being invoked like: * endInput() * getMetricGroup() -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23704) FLIP-27 sources are not generating LatencyMarkers
Piotr Nowojski created FLINK-23704: -- Summary: FLIP-27 sources are not generating LatencyMarkers Key: FLINK-23704 URL: https://issues.apache.org/jira/browse/FLINK-23704 Project: Flink Issue Type: Bug Components: API / DataStream, Runtime / Task Affects Versions: 1.13.2, 1.12.5, 1.14.0 Reporter: Piotr Nowojski Currently {{LatencyMarker}} is created only in {{StreamSource.LatencyMarksEmitter#LatencyMarksEmitter}}. FLIP-27 sources are never emitting it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23698) Pass watermarks in finished on restore operators
Piotr Nowojski created FLINK-23698: -- Summary: Pass watermarks in finished on restore operators Key: FLINK-23698 URL: https://issues.apache.org/jira/browse/FLINK-23698 Project: Flink Issue Type: Sub-task Components: Runtime / Task Affects Versions: 1.14.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.14.0 After merging FLINK-23541 there is a bug on restore finished tasks that we are loosing an information that max watermark has been already emitted. As task is finished, it means no new watermark will be ever emitted, while downstream tasks (for example two/multiple input tasks) would be deadlocked waiting for a watermark from an already finished input. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23648) Drop StreamTwoInputProcessor in favour of StreamMultipleInputProcessor
Piotr Nowojski created FLINK-23648: -- Summary: Drop StreamTwoInputProcessor in favour of StreamMultipleInputProcessor Key: FLINK-23648 URL: https://issues.apache.org/jira/browse/FLINK-23648 Project: Flink Issue Type: Bug Components: Runtime / Task Affects Versions: 1.14.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.14.0 StreamTwoInputProcessor is less efficient and can be simply replaced with StreamMultipleInputProcessor. This also solves a performance problem for FLINK-23541, where initial version was causing a performance regression in two input processor, while multiple input was working fine. Hence instead of optimising StreamTwoInputProcessor, it's just simpler to drop it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23593) Performance regression on 15.07.2021
Piotr Nowojski created FLINK-23593: -- Summary: Performance regression on 15.07.2021 Key: FLINK-23593 URL: https://issues.apache.org/jira/browse/FLINK-23593 Project: Flink Issue Type: Bug Components: Benchmarks Affects Versions: 1.14.0 Reporter: Piotr Nowojski http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput=2 http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput=2 {noformat} pnowojski@piotr-mbp: [~/flink - ((no branch, bisect started on pr/16589))] $ git ls f4afbf3e7de..eb8100f7afe eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) [FLINK-22017][coordination] Allow BLOCKING result partition to be individually consumable [Thesharing] d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) [FLINK-22017][coordination] Get the ConsumedPartitionGroup that IntermediateResultPartition and DefaultResultPartition belong to [Thesharing] d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther] 4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) [FLINK-21928][clients][runtime] Introduce static method constructors of DuplicateJobSubmissionException for better readability. [David Moravek] 172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should succeed, when trying to resubmit already terminated job in application mode. [David Moravek] f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce org.apache.flink.util.concurrent.FutureUtils#handleException method, that allows future to recover from the specied exception. [David Moravek] d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) [FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang] 16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related tests to a separate test class. [Yun Gao] 31b3d37a22c [7 weeks ago] [FLINK-21089][runtime] Skip the execution of new sources if finished on restore [Yun Gao] 20fe062e1b5 [3 weeks ago] [FLINK-21089][runtime] Skip execution for the legacy source task if finished on restore [Yun Gao] 874c627114b [3 weeks ago] [FLINK-21089][runtime] Skip the lifecycle method of operators if finished on restore [Yun Gao] ceaf24b1d88 [3 weeks ago] (pn/bisect-1, bisect-1, refs/bisect/good-ceaf24b1d881c2345a43f305d40435519a09cec9) [hotfix] Fix isClosed() for operator wrapper and proxy operator close to the operator chain [Yun Gao] 41ea591a6db [3 weeks ago] [FLINK-22627][runtime] Remove unused slot request protocol [Yangze Guo] 489346b60f8 [3 months ago] [FLINK-22627][runtime] Remove PendingSlotRequest [Yangze Guo] 8ffb4d2af36 [3 months ago] [FLINK-22627][runtime] Remove TaskManagerSlot [Yangze Guo] 72073741588 [3 months ago] [FLINK-22627][runtime] Remove SlotManagerImpl and its related tests [Yangze Guo] bdb3b7541b3 [3 months ago] [hotfix][yarn] Remove unused internal options in YarnConfigOptionsInternal [Yangze Guo] a6a9b192eac [3 weeks ago] [FLINK-23201][streaming] Reset alignment only for the currently processed checkpoint [Anton Kalashnikov] b35701a35c7 [3 weeks ago] [FLINK-23201][streaming] Calculate checkpoint alignment time only for last started checkpoint [Anton Kalashnikov] 3abec22c536 [3 weeks ago] [FLINK-23107][table-runtime] Separate implementation of deduplicate rank from other rank functions [Shuo Cheng] 1a195f5cc59 [3 weeks ago] [FLINK-16093][docs-zh] Translate "System Functions" page of "Functions" into Chinese (#16348) [ZhiJie Yang] {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23560) Performance regression on 29.07.2021
Piotr Nowojski created FLINK-23560: -- Summary: Performance regression on 29.07.2021 Key: FLINK-23560 URL: https://issues.apache.org/jira/browse/FLINK-23560 Project: Flink Issue Type: Bug Components: Benchmarks Affects Versions: 1.14.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.14.0 http://codespeed.dak8s.net:8000/timeline/?ben=remoteFilePartition=2 http://codespeed.dak8s.net:8000/timeline/?ben=uncompressedMmapPartition=2 http://codespeed.dak8s.net:8000/timeline/?ben=compressedFilePartition=2 http://codespeed.dak8s.net:8000/timeline/?ben=tupleKeyBy=2 http://codespeed.dak8s.net:8000/timeline/?ben=arrayKeyBy=2 http://codespeed.dak8s.net:8000/timeline/?ben=uncompressedFilePartition=2 http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput=2 http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput=2 http://codespeed.dak8s.net:8000/timeline/?ben=globalWindow=2 (And potentially other benchmarks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23559) Enable periodic materialisation in tests
Piotr Nowojski created FLINK-23559: -- Summary: Enable periodic materialisation in tests Key: FLINK-23559 URL: https://issues.apache.org/jira/browse/FLINK-23559 Project: Flink Issue Type: Sub-task Components: Runtime / State Backends Reporter: Roman Khachatryan Assignee: Roman Khachatryan Fix For: 1.14.0 FLINK-21448 adds the capability (test randomization), but it can't be turned on as there are some test failures: FLINK-23276, FLINK-23277, FLINK-23278 (should be enabled after those bugs fixed).. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23528) stop-with-savepoint can fail with FlinkKinesisConsumer
Piotr Nowojski created FLINK-23528: -- Summary: stop-with-savepoint can fail with FlinkKinesisConsumer Key: FLINK-23528 URL: https://issues.apache.org/jira/browse/FLINK-23528 Project: Flink Issue Type: Bug Components: Connectors / Kinesis Affects Versions: 1.12.4, 1.13.1 Reporter: Piotr Nowojski Fix For: 1.14.0 {{FlinkKinesisConsumer#cancel()}} (inside {{KinesisDataFetcher#shutdownFetcher()}}) shouldn't be interrupting source thread. Otherwise, as described in FLINK-23527, network stack can be left in an invalid state and downstream tasks can encounter deserialisation errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23527) Clarify `SourceFunction#cancel()` contract about interrupting
Piotr Nowojski created FLINK-23527: -- Summary: Clarify `SourceFunction#cancel()` contract about interrupting Key: FLINK-23527 URL: https://issues.apache.org/jira/browse/FLINK-23527 Project: Flink Issue Type: Bug Components: API / DataStream Affects Versions: 1.13.1 Reporter: Piotr Nowojski Fix For: 1.14.0 We should clarify the contract of {{SourceFunction#cancel()}} # source itself shouldn’t be interrupting the source thread # interrupt shouldn’t be expected in the clean cancellation case Interrupting the code on the clean shutdown path can cause failures when doing `stop-with-savepoint`. If source thread is interrupted during backpressure, this leaves network stack in invalid state, making it impossible to send {{EndOfPartitionEvent}} to complete the shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23398) State backend benchmarks are failing
Piotr Nowojski created FLINK-23398: -- Summary: State backend benchmarks are failing Key: FLINK-23398 URL: https://issues.apache.org/jira/browse/FLINK-23398 Project: Flink Issue Type: Bug Components: Benchmarks, Runtime / State Backends Affects Versions: 1.14.0 Reporter: Piotr Nowojski Assignee: Yun Tang Fix For: 1.14.0 http://codespeed.dak8s.net:8080/job/flink-statebackend-benchmark/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23382) Performance regression on 13.07.2021
Piotr Nowojski created FLINK-23382: -- Summary: Performance regression on 13.07.2021 Key: FLINK-23382 URL: https://issues.apache.org/jira/browse/FLINK-23382 Project: Flink Issue Type: Bug Components: Benchmarks, Runtime / Network Affects Versions: 1.14.0 Reporter: Piotr Nowojski Fix For: 1.14.0 It looks like there was a performance regression after merging FLINK-16641 http://codespeed.dak8s.net:8000/timeline/?ben=readFileSplit=2 http://codespeed.dak8s.net:8000/timeline/?ben=networkLatency1to1=2 http://codespeed.dak8s.net:8000/timeline/?ben=networkSkewedThroughput=2 (and a couple of other benchmarks) {code} $ git ls 4fddcb27c5..657df14677 657df14677a [2 days ago] [FLINK-23356][hbase] Do not use delegation token in case of keytab [Gabor Somogyi] 5aba6167343 [3 days ago] [FLINK-21088][runtime][checkpoint] Pass the finish on restore status to operator chain [Yun Gao] df99b49e3f4 [2 days ago] [FLINK-21088][runtime][checkpoint] Pass the finished on restore status to TaskStateManager [Yun Gao] 1251ab8566f [2 days ago] [FLINK-21088][runtime][checkpoint] Assign a finished snapshot to task if operators are all finished on restore [Yun Gao] 83c5e710673 [33 hours ago] [FLINK-23150][table-planner] Remove the old code split implementation [tsreaper] 2e374954b9b [2 days ago] [FLINK-23359][test] Fix the number of available slots in testResourceCanBeAllocatedForDifferentJobAfterFree [Yangze Guo] 2268baf211f [5 days ago] [FLINK-23235][connector] Fix SinkITCase instability [GuoWei Ma] 60d015cfc65 [6 days ago] [FLINK-16641][network] (Part#6) Enable to set network buffers per channel to 0 [kevin.cyj] 7d1bb5f5e0b [6 days ago] [FLINK-16641][network] (Part#5) Send empty buffers to the downstream tasks to release the allocated credits if the exclusive credit is 0 [kevin.cyj] 985382020b1 [6 days ago] [FLINK-16641][network] (Part#4) Release all allocated floating buffers of RemoteInputChannel on receiving any channel blocking event if the exclusive credit is 0 [kevin.cyj] 1bf36cbcbbf [6 days ago] [FLINK-16641][network] (Part#3) Support to announce the upstream backlog to the downstream tasks [kevin.cyj] 31e1cd174d6 [6 days ago] [FLINK-16641][network] (Part#2) Distinguish data buffer and event buffer for BoundedBlockingSubpartitionDirectTransferReader [kevin.cyj] 3a46604c068 [7 days ago] [FLINK-16641][network] (Part#1) Introduce a new network message BacklogAnnouncement which can bring the upstream buffer backlog to the downstream [kevin.cyj] 1746146ab9e [5 days ago] [hotfix] Fix typos in NettyShuffleUtils [kevin.cyj] 73fcb75352b [4 months ago] [hotfix] Simplify RemoteInputChannel#onSenderBacklog and call the existing method directly [kevin.cyj] dc80bb07fb9 [4 months ago] [hotfix] Remove outdated comments in UnionInputGate [kevin.cyj] 790576b544e [4 months ago] [hotfix] Remove redundant if condition in BufferManager [kevin.cyj] cf7d7258c5e [2 days ago] [hotfix][tests] Remove unnecessary sleep from RemoteAkkaRpcActorTest.failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable [Till Rohrmann] 4a3e6d6a769 [2 months ago] [FLINK-11103][runtime] Set a configurable uncaught exception handler for all entrypoints [Ashwin Kolhatkar] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23078) Scheduler Benchmarks not compiling
Piotr Nowojski created FLINK-23078: -- Summary: Scheduler Benchmarks not compiling Key: FLINK-23078 URL: https://issues.apache.org/jira/browse/FLINK-23078 Project: Flink Issue Type: Bug Components: Benchmarks, Runtime / Coordination Affects Versions: 1.14.0 Reporter: Piotr Nowojski Fix For: 1.14.0 {code:java} 07:46:50 [ERROR] /home/jenkins/workspace/flink-master-benchmarks/flink-benchmarks/src/main/java/org/apache/flink/scheduler/benchmark/SchedulerBenchmarkBase.java:21:44: error: cannot find symbol {code} CC [~chesnay] [~Thesharing] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23011) FLIP-27 sources are generating non-deterministic results when using event time
Piotr Nowojski created FLINK-23011: -- Summary: FLIP-27 sources are generating non-deterministic results when using event time Key: FLINK-23011 URL: https://issues.apache.org/jira/browse/FLINK-23011 Project: Flink Issue Type: New Feature Components: API / DataStream Affects Versions: 1.12.4, 1.13.1, 1.14.0 Environment: Reporter: Piotr Nowojski FLIP-27 sources currently start in the {{StreamStatus.IDLE}} state and they switch to {{ACTIVE}} only after emitting first {{Watermark}}. Until this happens, downstream operators are ignoring {{IDLE}} inputs from calculating the input (min) watermark. An extreme example to what problem this leads to, are completely bogus results if for example one FLIP-27 source subtask is slower than others for some reason: {code:java} env.getConfig().setAutoWatermarkInterval(2000); env.setParallelism(2); env.setRestartStrategy(RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, 10)); DataStream eventStream = env.fromSource( new NumberSequenceSource(0, Long.MAX_VALUE), WatermarkStrategy.forMonotonousTimestamps() .withTimestampAssigner(new LongTimestampAssigner()), "NumberSequenceSource") .map( new RichMapFunction() { @Override public Long map(Long value) throws Exception { if (getRuntimeContext().getIndexOfThisSubtask() == 0) { Thread.sleep(1); } return 1L; } }); eventStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(1))).sum(0).print(); (...) private static class LongTimestampAssigner implements SerializableTimestampAssigner { private long counter = 0; @Override public long extractTimestamp(Long record, long recordTimeStamp) { return counter++; } } {code} In such case, after 2 seconds ({{setAutoWatermarkInterval}}) the not throttled subtask (subTaskId == 1) generates very high watermarks. The other source subtask (subTaskId == 0) emits very low watermarks. If the non throttled watermark reaches the downstream {{WindowOperator}} first, while the other input channel is still idle, it will take those high watermarks as combined input watermark for the the whole {{WindowOperator}}. When the input channel from the throttled source subtask finally receives it's {{ACTIVE}} status and a much lower watermark, that's already too late. Actual output of the example program: {noformat} 1596 2000 1000 1000 1000 1000 1000 1000 (...) {noformat} while the expected output should be always "2000" (2000 records fitting in every 1 second global window) {noformat} 2000 2000 2000 2000 (...) {noformat}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22973) Provide benchmark for unaligned checkpoints performance
Piotr Nowojski created FLINK-22973: -- Summary: Provide benchmark for unaligned checkpoints performance Key: FLINK-22973 URL: https://issues.apache.org/jira/browse/FLINK-22973 Project: Flink Issue Type: New Feature Components: Benchmarks, Runtime / Network Affects Versions: 1.14.0 Reporter: Piotr Nowojski Assignee: Piotr Nowojski Fix For: 1.14.0 Provide some simple benchmark to test speed of the unaligned checkpoints on our continuous benchmarking infrastructure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22904) Performance regression on 25.05.2020 in mapRebalanceMapSink
Piotr Nowojski created FLINK-22904: -- Summary: Performance regression on 25.05.2020 in mapRebalanceMapSink Key: FLINK-22904 URL: https://issues.apache.org/jira/browse/FLINK-22904 Project: Flink Issue Type: Bug Components: Benchmarks Affects Versions: 1.14.0 Reporter: Piotr Nowojski http://codespeed.dak8s.net:8000/timeline/?ben=mapRebalanceMapSink=2 Suspected range in which this regression happened 21c44688e98..80ad5b3b51 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22882) Tasks are blocked while emitting records
Piotr Nowojski created FLINK-22882: -- Summary: Tasks are blocked while emitting records Key: FLINK-22882 URL: https://issues.apache.org/jira/browse/FLINK-22882 Project: Flink Issue Type: Bug Components: Runtime / Network, Runtime / Task Affects Versions: 1.14.0 Reporter: Piotr Nowojski On a cluster I observed symptoms of tasks being blocked for long time, causing long delays with unaligned checkpointing. 99% of those cases were caused by `broadcastEmit` of the stream status {noformat} 2021-06-04 14:41:44,049 ERROR org.apache.flink.runtime.io.network.buffer.LocalBufferPool [] - Blocking wait [11059 ms] for an available buffer. java.lang.Exception: Stracktracegenerator at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:323) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:290) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewBufferBuilderFromPool(BufferWritingResultPartition.java:338) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewUnicastBufferBuilder(BufferWritingResultPartition.java:314) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.appendUnicastDataForNewRecord(BufferWritingResultPartition.java:246) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.emitRecord(BufferWritingResultPartition.java:142) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:104) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.broadcastEmit(ChannelSelectorRecordWriter.java:67) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.io.RecordWriterOutput.writeStreamStatus(RecordWriterOutput.java:136) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.streamstatus.AnnouncedStatus.ensureActive(AnnouncedStatus.java:65) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:103) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:90) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:44) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:56) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:29) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.ChainingOutput.pushToOperator(ChainingOutput.java:101) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:82) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:39) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:182) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:98) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:294) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at
[jira] [Created] (FLINK-22881) Tasks are blocked while emitting stream status
Piotr Nowojski created FLINK-22881: -- Summary: Tasks are blocked while emitting stream status Key: FLINK-22881 URL: https://issues.apache.org/jira/browse/FLINK-22881 Project: Flink Issue Type: Bug Components: Runtime / Network, Runtime / Task Affects Versions: 1.14.0 Reporter: Piotr Nowojski On a cluster I observed symptoms of tasks being blocked for long time, causing long delays with unaligned checkpointing. 99% of those cases were caused by `broadcastEmit` of the stream status ``` 2021-06-04 14:41:44,049 ERROR org.apache.flink.runtime.io.network.buffer.LocalBufferPool [] - Blocking wait [11059 ms] for an available buffer. java.lang.Exception: Stracktracegenerator at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:323) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:290) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewBufferBuilderFromPool(BufferWritingResultPartition.java:338) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewUnicastBufferBuilder(BufferWritingResultPartition.java:314) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.appendUnicastDataForNewRecord(BufferWritingResultPartition.java:246) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.emitRecord(BufferWritingResultPartition.java:142) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:104) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.broadcastEmit(ChannelSelectorRecordWriter.java:67) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.io.RecordWriterOutput.writeStreamStatus(RecordWriterOutput.java:136) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.streamstatus.AnnouncedStatus.ensureActive(AnnouncedStatus.java:65) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:103) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:90) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:44) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:56) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:29) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.ChainingOutput.pushToOperator(ChainingOutput.java:101) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:82) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:39) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:182) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:98) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:294) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at
[jira] [Created] (FLINK-22833) Source tasks (both old and new) are not reporting checkpointStartDelay via CheckpointMetrics
Piotr Nowojski created FLINK-22833: -- Summary: Source tasks (both old and new) are not reporting checkpointStartDelay via CheckpointMetrics Key: FLINK-22833 URL: https://issues.apache.org/jira/browse/FLINK-22833 Project: Flink Issue Type: Bug Components: Runtime / Metrics Affects Versions: 1.12.4, 1.13.1 Reporter: Piotr Nowojski Assignee: Piotr Nowojski -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22814) New sources are not defining/exposing checkpointStartDelayNanos metric
Piotr Nowojski created FLINK-22814: -- Summary: New sources are not defining/exposing checkpointStartDelayNanos metric Key: FLINK-22814 URL: https://issues.apache.org/jira/browse/FLINK-22814 Project: Flink Issue Type: Bug Components: API / DataStream, Runtime / Metrics Affects Versions: 1.12.4, 1.13.1 Reporter: Piotr Nowojski Assignee: Piotr Nowojski checkpointStartDelayNanos metric for new (FLIP-27) sources is always 0ms in the WebUI, regardless how long it took to finally start triggering the checkpoint. -- This message was sent by Atlassian Jira (v8.3.4#803005)