[jira] [Created] (FLINK-35528) Skip execution of interruptible mails when yielding

2024-06-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35528:
--

 Summary: Skip execution of interruptible mails when yielding
 Key: FLINK-35528
 URL: https://issues.apache.org/jira/browse/FLINK-35528
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Runtime / Task
Affects Versions: 1.20.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


When operators are yielding, for example waiting for async state access to 
complete before a checkpoint, it would be beneficial to not execute 
interruptible mails. Otherwise continuation mail for firing timers would be 
continuously re-enqeueed. To achieve that MailboxExecutor must be aware which 
mails are interruptible.

The easiest way to achieve this is to set MIN_PRIORITY for interruptible mails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35518) CI Bot doesn't run on PRs

2024-06-04 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35518:
--

 Summary: CI Bot doesn't run on PRs
 Key: FLINK-35518
 URL: https://issues.apache.org/jira/browse/FLINK-35518
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI
Affects Versions: 1.20.0
Reporter: Piotr Nowojski


Doesn't want to run on my PR/branch. I was doing force-pushes, rebases, asking 
flink bot to run, closed and opened new PR, but nothing helped
https://github.com/apache/flink/pull/24868
https://github.com/apache/flink/pull/24883

I've heard others were having similar problems recently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35420) WordCountMapredITCase fails to compile in IntelliJ

2024-05-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35420:
--

 Summary: WordCountMapredITCase fails to compile in IntelliJ
 Key: FLINK-35420
 URL: https://issues.apache.org/jira/browse/FLINK-35420
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hadoop Compatibility
Affects Versions: 1.20.0
Reporter: Piotr Nowojski


{noformat}
flink-connectors/flink-hadoop-compatibility/src/test/scala/org/apache/flink/api/hadoopcompatibility/scala/WordCountMapredITCase.scala:42:8
value isFalse is not a member of ?0
possible cause: maybe a semicolon is missing before `value isFalse'?
  .isFalse()
{noformat}

Might be caused by:
https://youtrack.jetbrains.com/issue/SCL-20679




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35065) Add numFiredTimers and numFiredTimersPerSecond metrics

2024-04-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35065:
--

 Summary: Add numFiredTimers and numFiredTimersPerSecond metrics
 Key: FLINK-35065
 URL: https://issues.apache.org/jira/browse/FLINK-35065
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Metrics, Runtime / Task
Affects Versions: 1.19.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.20.0


Currently there is now way of knowing how many timers are being fired by Flink, 
so it's impossible to distinguish, even using code profiling, if operator is 
firing only a couple of heavy timers per second using ~100% of the CPU time, vs 
firing thousands of timer per seconds.

We could add the following metrics to address this issue:
* numFiredTimers - total number of fired timers per operator
* numFiredTimersPerSecond - per second rate of firing timers per operator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35051) Weird priorities when processing unaligned checkpoints

2024-04-08 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35051:
--

 Summary: Weird priorities when processing unaligned checkpoints
 Key: FLINK-35051
 URL: https://issues.apache.org/jira/browse/FLINK-35051
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing, Runtime / Network, Runtime / Task
Affects Versions: 1.18.1, 1.19.0, 1.17.2
Reporter: Piotr Nowojski


While looking through the code I noticed that `StreamTask` is processing 
unaligned checkpoints in strange order/priority. The end result is that 
unaligned checkpoint `Start Delay` time can be increased, and triggering 
checkpoints in `StreamTask` can be unnecessary delayed by other mailbox actions 
in the system, like for example:
* processing time timers
* `AsyncWaitOperator` results
* ... 

Incoming UC barrier is treated as a priority event by the network stack (it 
will be polled from the input before anything else). This is what we want, but 
polling elements from network stack has lower priority then processing enqueued 
mailbox actions.

Secondly, if AC barrier timeout to UC, that's done via a mailbox action, but 
this mailbox action is also not prioritised in any way, so other mailbox 
actions could be unnecessarily executed first. 

On top of that there is a clash of two separate concepts here:
# Mailbox priority. yieldToDownstream - so in a sense reverse to what we would 
like to have for triggering checkpoint, but that only kicks in #yield() calls, 
where it's actually correct, that operator in a middle of execution can not 
yield to checkpoint - it should only yield to downstream.
# Control mails in mailbox executor - cancellation is done via that, it 
bypasses whole mailbox queue.
# Priority events in the network stack.

It's unfortunate that 1. vs 3. has a naming clash, as priority name is used in 
both things, and highest network priority event containing UC barrier, when 
executed via mailbox has actually the lowest mailbox priority.

Control mails mechanism is a kind of priority mails executed out of order, but 
doesn't generalise well for use in checkpointing.

This whole thing should be re-worked at some point. Ideally what we would like 
have is that:
* mail to convert AC barriers to UC
* polling UC barrier from the network input
* checkpoint trigger via RPC for source tasks
should be processed first, with an exception of yieldToDownstream, where 
current mailbox priorities should be adhered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34913) ConcurrentModificationException SubTaskInitializationMetricsBuilder.addDurationMetric

2024-03-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-34913:
--

 Summary: ConcurrentModificationException 
SubTaskInitializationMetricsBuilder.addDurationMetric
 Key: FLINK-34913
 URL: https://issues.apache.org/jira/browse/FLINK-34913
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics, Runtime / State Backends
Affects Versions: 1.19.0
Reporter: Piotr Nowojski
 Fix For: 1.19.1


The following failures can occur during job's recovery when using clip & ingest

{noformat}
java.util.ConcurrentModificationException
at java.base/java.util.HashMap.compute(HashMap.java:1230)
at 
org.apache.flink.runtime.checkpoint.SubTaskInitializationMetricsBuilder.addDurationMetric(SubTaskInitializationMetricsBuilder.java:49)
at 
org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.runAndReportDuration(RocksDBIncrementalRestoreOperation.java:988)
at 
org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.lambda$createAsyncCompactionTask$2(RocksDBIncrementalRestoreOperation.java:291)
at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}

{noformat}
java.util.ConcurrentModificationException
at java.base/java.util.HashMap.compute(HashMap.java:1230)
at 
org.apache.flink.runtime.checkpoint.SubTaskInitializationMetricsBuilder.addDurationMetric(SubTaskInitializationMetricsBuilder.java:49)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateAndGates(StreamTask.java:794)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$restoreInternal$3(StreamTask.java:744)
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:744)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:704)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:953)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:922)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:746)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34430) Akka frame size exceeded with many ByteStreamStateHandle being used

2024-02-12 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-34430:
--

 Summary: Akka frame size exceeded with many ByteStreamStateHandle 
being used
 Key: FLINK-34430
 URL: https://issues.apache.org/jira/browse/FLINK-34430
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.17.2, 1.16.3, 1.19.0
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33775) Report JobInitialization traces

2023-12-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33775:
--

 Summary: Report JobInitialization traces
 Key: FLINK-33775
 URL: https://issues.apache.org/jira/browse/FLINK-33775
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33709) Report CheckpointStats as Spans

2023-11-30 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33709:
--

 Summary: Report CheckpointStats as Spans
 Key: FLINK-33709
 URL: https://issues.apache.org/jira/browse/FLINK-33709
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33708) Add Span and TraceReporter concepts

2023-11-30 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33708:
--

 Summary: Add Span and TraceReporter concepts
 Key: FLINK-33708
 URL: https://issues.apache.org/jira/browse/FLINK-33708
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33696) FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter

2023-11-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33696:
--

 Summary: FLIP-385: Add OpenTelemetryTraceReporter and 
OpenTelemetryMetricReporter
 Key: FLINK-33696
 URL: https://issues.apache.org/jira/browse/FLINK-33696
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Metrics
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.19.0


h1. Motivation

[FLIP-384|https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces]
 is adding TraceReporter interface. However with 
[FLIP-384|https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces]
 alone, Log4jTraceReporter would be the only available implementation of 
TraceReporter interface, which is not very helpful.

In this FLIP I’m proposing to contribute both MetricExporter and TraceReporter 
implementation using OpenTelemetry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33697) FLIP-386: Support adding custom metrics in Recovery Spans

2023-11-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33697:
--

 Summary: FLIP-386: Support adding custom metrics in Recovery Spans
 Key: FLINK-33697
 URL: https://issues.apache.org/jira/browse/FLINK-33697
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Metrics, Runtime / State Backends
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.19.0


h1. Motivation

FLIP-386 is building on top of 
[FLIP-384|https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces].
 The intention here is to add a capability for state backends to attach custom 
attributes during recovery to recovery spans. For example 
RocksDBIncrementalRestoreOperation could report both remote download time and 
time to actually clip/ingest the RocksDB instances after rescaling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33695) FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

2023-11-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33695:
--

 Summary: FLIP-384: Introduce TraceReporter and use it to create 
checkpointing and recovery traces
 Key: FLINK-33695
 URL: https://issues.apache.org/jira/browse/FLINK-33695
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Checkpointing, Runtime / Metrics
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.19.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33338) Bump up RocksDB version to 7.x

2023-10-23 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-8:
--

 Summary: Bump up RocksDB version to 7.x
 Key: FLINK-8
 URL: https://issues.apache.org/jira/browse/FLINK-8
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Piotr Nowojski


We need to bump RocksDB in order to be able to use new IngestDB and ClipDB 
commands.

If some of the required changes haven't been merged to Facebook/RocksDB, we 
should cherry-pick and include them in our FRocksDB fork.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33337) Expose IngestDB and ClipDB in the official RocksDB API

2023-10-23 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-7:
--

 Summary: Expose IngestDB and ClipDB in the official RocksDB API
 Key: FLINK-7
 URL: https://issues.apache.org/jira/browse/FLINK-7
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Piotr Nowojski
Assignee: Yue Ma


Remaining open PR: https://github.com/facebook/rocksdb/pull/11646



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33071) Log checkpoint statistics

2023-09-11 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33071:
--

 Summary: Log checkpoint statistics 
 Key: FLINK-33071
 URL: https://issues.apache.org/jira/browse/FLINK-33071
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Checkpointing, Runtime / Metrics
Affects Versions: 1.18.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


This is a stop gap solution until we have a proper way of solving FLINK-23411.

The plan is to dump JSON serialised checkpoint statistics into Flink JM's log, 
with a {{DEBUG}} level. This could be used to analyse what has happened with a 
certain checkpoint in the past.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32303) Incorrect error message in KafkaSource

2023-06-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-32303:
--

 Summary: Incorrect error message in KafkaSource 
 Key: FLINK-32303
 URL: https://issues.apache.org/jira/browse/FLINK-32303
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.17.1, 1.18.0
Reporter: Piotr Nowojski


When exception is thrown from an operator chained with a KafkaSource, 
KafkaSource is returning a misleading error, like shown below:

{noformat}
java.io.IOException: Failed to deserialize consumer record due to
at 
org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:56)
 ~[classes/:?]
at 
org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:33)
 ~[classes/:?]
at 
org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:160)
 ~[classes/:?]
at 
org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:562)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:852)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:801) 
~[classes/:?]
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
 ~[classes/:?]
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) 
[classes/:?]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) 
[classes/:?]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) 
[classes/:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: 
org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: 
Could not forward element to next operator
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:92)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:309)
 ~[classes/:?]
at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
 ~[classes/:?]
at 
org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter$SourceOutputWrapper.collect(KafkaRecordEmitter.java:67)
 ~[classes/:?]
at 
org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:84)
 ~[classes/:?]
at 
org.apache.flink.connector.kafka.source.reader.deserializer.KafkaValueOnlyDeserializationSchemaWrapper.deserialize(KafkaValueOnlyDeserializationSchemaWrapper.java:51)
 ~[classes/:?]
at 
org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:53)
 ~[classes/:?]
... 14 more
Caused by: org.apache.flink.runtime.operators.testutils.ExpectedTestException: 
Failover!
at 
org.apache.flink.test.ManySmallJobsBenchmarkITCase$ThrottlingAndFailingIdentityMap.map(ManySmallJobsBenchmarkITCase.java:263)
 ~[test-classes/:?]
at 
org.apache.flink.test.ManySmallJobsBenchmarkITCase$ThrottlingAndFailingIdentityMap.map(ManySmallJobsBenchmarkITCase.java:243)
 ~[test-classes/:?]
at 
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:309)
 ~[classes/:?]
at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
 ~[classes/:?]
   

[jira] [Created] (FLINK-31045) In IntelliJ: flink-clients cannot find symbol symbol: class TestUserClassLoaderJobLib

2023-02-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-31045:
--

 Summary: In IntelliJ: flink-clients cannot find symbol symbol: 
class TestUserClassLoaderJobLib
 Key: FLINK-31045
 URL: https://issues.apache.org/jira/browse/FLINK-31045
 Project: Flink
  Issue Type: Bug
  Components: Build System, Client / Job Submission
Affects Versions: 1.18.0
Reporter: Piotr Nowojski


When trying to build/run some tests in the IDE, IntelliJ is reporting the 
following compilation failure:

{noformat}
/XXX/flink/flink-clients/src/test/java/org/apache/flink/client/testjar/TestUserClassLoaderJob.java:33:38
java: cannot find symbol
  symbol:   class TestUserClassLoaderJobLib
  location: class org.apache.flink.client.testjar.TestUserClassLoaderJob
{noformat}

A workaround seems to be to:
# right click on {{flink-clients}}
# Rebuild module (flink-clients)

The issue is probably related to the comment from the flink-clients/pom.xml 
file:
{noformat}

{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30791) Codespeed machine is not responding

2023-01-25 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-30791:
--

 Summary: Codespeed machine is not responding
 Key: FLINK-30791
 URL: https://issues.apache.org/jira/browse/FLINK-30791
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.16.0, 1.17.0
Reporter: Piotr Nowojski


Neither speedcenter: [http://codespeed.dak8s.net:8000/]

nor jenkins: [http://codespeed.dak8s.net:8080|http://codespeed.dak8s.net:8080/]

are responding



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30155) Pretty print MutatedConfigurationException

2022-11-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-30155:
--

 Summary: Pretty print MutatedConfigurationException
 Key: FLINK-30155
 URL: https://issues.apache.org/jira/browse/FLINK-30155
 Project: Flink
  Issue Type: New Feature
  Components: API / DataStream
Affects Versions: 1.17.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.17.0


Currently MutatedConfigurationException is printed as:

{noformat}
org.apache.flink.client.program.MutatedConfigurationException: 
Configuration execution.sorted-inputs.enabled:true not allowed.
Configuration execution.runtime-mode was changed from STREAMING to BATCH.
Configuration execution.checkpointing.interval:500 ms not allowed in the 
configuration object CheckpointConfig.
Configuration execution.checkpointing.mode:EXACTLY_ONCE not allowed in the 
configuration object CheckpointConfig.
Configuration pipeline.max-parallelism:1024 not allowed in the 
configuration object ExecutionConfig.
Configuration parallelism.default:25 not allowed in the configuration 
object ExecutionConfig.
at 
org.apache.flink.client.program.StreamContextEnvironment.checkNotAllowedConfigurations(StreamContextEnvironment.java:235)
at 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:175)
at 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:115)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2049)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2027)
{noformat}
Which is slightly confusing. First not allowed configuration is listed in the 
same line as the exception name, which (especially if wrapped) can make it more 
difficult than necessary for user to understand that this is a list of 
violations. I'm proposing to change it to:
{noformat}
org.apache.flink.client.program.MutatedConfigurationException: Not allowed 
configuration change(s) were detected:
 - Configuration execution.sorted-inputs.enabled:true not allowed.
 - Configuration execution.runtime-mode was changed from STREAMING to BATCH.
 - Configuration execution.checkpointing.interval:500 ms not allowed in the 
configuration object CheckpointConfig.
 - Configuration execution.checkpointing.mode:EXACTLY_ONCE not allowed in 
the configuration object CheckpointConfig.
 - Configuration pipeline.max-parallelism:1024 not allowed in the 
configuration object ExecutionConfig.
 - Configuration parallelism.default:25 not allowed in the configuration 
object ExecutionConfig.
at 
org.apache.flink.client.program.StreamContextEnvironment.checkNotAllowedConfigurations(StreamContextEnvironment.java:235)
at 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:175)
at 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:115)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2049)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2027)
{noformat} 
To make it more clear.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30070) Create savepoints without side effects

2022-11-17 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-30070:
--

 Summary: Create savepoints without side effects
 Key: FLINK-30070
 URL: https://issues.apache.org/jira/browse/FLINK-30070
 Project: Flink
  Issue Type: New Feature
  Components: API / DataStream, Runtime / Checkpointing
Affects Versions: 1.14.6, 1.15.2, 1.16.0
Reporter: Piotr Nowojski


Side effects are any external state - a state that is stored not in Flink, but 
in an external system, like for example connectors transactions (KafkaSink, 
...).

We shouldn't be relaying on the external systems for storing part of the job's 
state, especially for any long period of time. The most prominent issue is that 
Kafka transactions can time out, leading to a data loss if transaction hasn't 
been committed.

Stop-with-savepoint, currently  guarantee that {{notifyCheckpointCompleted}} 
call will be issued, so properly implemented operators are guaranteed to 
committed it's state. However this information is currently not stored in the 
checkpoint in any way ( FLINK-16419 ). Larger issue is how to deal with 
savepoints, since there we currently do not have any guarantees that 
transactions have been committed. 

Some potential solution might be to expand API (like {{CheckpointedFunction}} 
), to let the operators/functions know, that they should 
close/commit/clear/deal with external state differently and use that API during 
stop-with-savepoint + rework how regular savepoints are handled. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30068) Allow users to configure what to do with errors while committing transactions during recovery in KafkaSink

2022-11-17 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-30068:
--

 Summary: Allow users to configure what to do with errors while 
committing transactions during recovery in KafkaSink
 Key: FLINK-30068
 URL: https://issues.apache.org/jira/browse/FLINK-30068
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Kafka
Affects Versions: 1.15.2, 1.16.0, 1.17.0
Reporter: Piotr Nowojski
 Fix For: 1.17.0


Currently it looks like {{KafkaSink}} fails the job on any failures to commit 
transactions. As [reported by the 
user|https://lists.apache.org/thread/4f6bb8j6qtvgp888y4dxgj86x3kw2b11], this 
makes impossible for jobs to recover from older Savepoints.

{noformat}
2022-11-16 10:01:07.168 [flink-akka.actor.default-dispatcher-13] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph  - Balances aggreagation 
ETH -> (Filter -> Map -> Save to Kafka realtime ETH: Writer -> Save to Kafka 
realtime ETH: Committer, Filter -> Map -> Save to Kafka daily ETH: Writer -> 
Save to Kafka daily ETH: Committer) (4/5) (6d4d91ab8657bba830695b9a011f7db6) 
switched from INITIALIZING to RUNNING.
2022-11-16 10:01:37.222 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering 
checkpoint 65436 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1668592897201 for job 
.
2022-11-16 10:01:39.082 [flink-akka.actor.default-dispatcher-13] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph  - Balances aggreagation 
ETH -> (Filter -> Map -> Save to Kafka realtime ETH: Writer -> Save to Kafka 
realtime ETH: Committer, Filter -> Map -> Save to Kafka daily ETH: Writer -> 
Save to Kafka daily ETH: Committer) (1/5) (cfaca46e7f4dc89629cdcaed5b48c059) 
switched from RUNNING to FAILED on 10.42.145.181:33297-efc328 @ 
eth-top-holders-v2-flink-taskmanager-0.eth-top-holders-v2-flink-taskmanager.flink.svc.cluster.local
 (dataPort=43125).
java.io.IOException: Could not perform checkpoint 65436 for operator Balances 
aggreagation ETH -> (Filter -> Map -> Save to Kafka realtime ETH: Writer -> 
Save to Kafka realtime ETH: Committer, Filter -> Map -> Save to Kafka daily 
ETH: Writer -> Save to Kafka daily ETH: Committer) (1/5)#0.
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1210)
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint(SingleCheckpointBarrierHandler.java:287)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100(SingleCheckpointBarrierHandler.java:64)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint(SingleCheckpointBarrierHandler.java:493)
at 
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint(AbstractAlignedBarrierHandlerState.java:74)
at 
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived(AbstractAlignedBarrierHandlerState.java:66)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2(SingleCheckpointBarrierHandler.java:234)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState(SingleCheckpointBarrierHandler.java:262)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:231)
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181)
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159)
at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:519)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
at 

[jira] [Created] (FLINK-29888) Improve MutatedConfigurationException for disallowed changes in CheckpointConfig and ExecutionConfig

2022-11-04 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29888:
--

 Summary: Improve MutatedConfigurationException for disallowed 
changes in CheckpointConfig and ExecutionConfig
 Key: FLINK-29888
 URL: https://issues.apache.org/jira/browse/FLINK-29888
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream
Affects Versions: 1.17.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.17.0


Currently if {{CheckpointConfig}} or {{ExecutionConfig}} are modified in a 
non-allowed way, user gets a generic error "Configuration object 
ExecutionConfig changed", without a hint of what has been modified. With 
FLINK-29379 we can improve this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29886) networkThroughput 1000,100ms,OpenSSL Benchmark is failing

2022-11-04 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29886:
--

 Summary: networkThroughput 1000,100ms,OpenSSL Benchmark is failing
 Key: FLINK-29886
 URL: https://issues.apache.org/jira/browse/FLINK-29886
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks, Runtime / Network
Affects Versions: 1.17.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8080/job/flink-master-benchmarks-java8/837/console

{noformat}
04:09:40  java.lang.NoClassDefFoundError: 
org/apache/flink/shaded/netty4/io/netty/internal/tcnative/CertificateCompressionAlgo
04:09:40at 
org.apache.flink.shaded.netty4.io.netty.handler.ssl.OpenSslX509KeyManagerFactory$OpenSslKeyManagerFactorySpi.engineInit(OpenSslX509KeyManagerFactory.java:129)
04:09:40at 
javax.net.ssl.KeyManagerFactory.init(KeyManagerFactory.java:256)
04:09:40at 
org.apache.flink.runtime.net.SSLUtils.getKeyManagerFactory(SSLUtils.java:279)
04:09:40at 
org.apache.flink.runtime.net.SSLUtils.createInternalNettySSLContext(SSLUtils.java:324)
04:09:40at 
org.apache.flink.runtime.net.SSLUtils.createInternalNettySSLContext(SSLUtils.java:303)
04:09:40at 
org.apache.flink.runtime.net.SSLUtils.createInternalClientSSLEngineFactory(SSLUtils.java:119)
04:09:40at 
org.apache.flink.runtime.io.network.netty.NettyConfig.createClientSSLEngineFactory(NettyConfig.java:147)
04:09:40at 
org.apache.flink.runtime.io.network.netty.NettyClient.init(NettyClient.java:115)
04:09:40at 
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.start(NettyConnectionManager.java:87)
04:09:40at 
org.apache.flink.runtime.io.network.NettyShuffleEnvironment.start(NettyShuffleEnvironment.java:349)
04:09:40at 
org.apache.flink.streaming.runtime.io.benchmark.StreamNetworkBenchmarkEnvironment.setUp(StreamNetworkBenchmarkEnvironment.java:133)
04:09:40at 
org.apache.flink.streaming.runtime.io.benchmark.StreamNetworkThroughputBenchmark.setUp(StreamNetworkThroughputBenchmark.java:108)
04:09:40at 
org.apache.flink.benchmark.StreamNetworkThroughputBenchmarkExecutor$MultiEnvironment.setUp(StreamNetworkThroughputBenchmarkExecutor.java:117)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29807) Drop TypeSerializerConfigSnapshot and savepoint support from Flink versions < 1.8.0

2022-10-31 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29807:
--

 Summary: Drop TypeSerializerConfigSnapshot and savepoint support 
from Flink versions < 1.8.0
 Key: FLINK-29807
 URL: https://issues.apache.org/jira/browse/FLINK-29807
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.17
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.17


The motivation behind this move is two fold. One reason is that it complicates 
our code base unnecessarily and creates confusion on how to actually implement 
custom serializers. The immediate reason is that I wanted to clean up Flink's 
configuration stack a bit and refactor the ExecutionConfig class [2]. This 
refactor would keep the API compatibility of the ExecutionConfig, but it would 
break savepoint compatibility with snapshots written with some of the old 
serializers, which had ExecutionConfig as a field and were serialized in the 
snapshot. This issue has been resolved by the introduction of 
TypeSerializerSnapshot in Flink 1.7 [3], where serializers are no longer part 
of the snapshot.

TypeSerializerConfigSnapshot has been deprecated and no longer used by built-in 
serializers since Flink 1.8 [4] and [5]. Users were encouraged to migrate to 
TypeSerializerSnapshot since then with their own custom serializers. That has 
been plenty of time for the migration.

This proposal would have the following impact for the users:
1. we would drop support for recovery from savepoints taken with Flink < 1.7.0 
for all built in types serializers
2. we would drop support for recovery from savepoints taken with Flink < 1.8.0 
for built in kryo serializers
3. we would drop support for recovery from savepoints taken with Flink < 1.17 
for custom serializers using deprecated TypeSerializerConfigSnapshot

1. and 2. would have a simple migration path. Users migrating from those old 
savepoints would have to first start his job using a Flink version from the 
[1.8, 1.16] range, and take a new savepoint that would be compatible with Flink 
1.17.
3. This is a bit more problematic, because users would have to first migrate 
their own custom serializers to use TypeSerializerSnapshot (using a Flink 
version from the [1.8, 1.16]), take a savepoint, and only then migrate to Flink 
1.17. However users had already 4 years to migrate, which in my opinion has 
been plenty of time to do so.

*As discussed and vote is currently in progress*: 
https://lists.apache.org/thread/x5d0p08pf2wx47njogsgqct0k5rpfrl4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29662) PojoSerializerSnapshot is using incorrect ExecutionConfig when restoring serializer

2022-10-17 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29662:
--

 Summary: PojoSerializerSnapshot is using incorrect ExecutionConfig 
when restoring serializer
 Key: FLINK-29662
 URL: https://issues.apache.org/jira/browse/FLINK-29662
 Project: Flink
  Issue Type: Bug
  Components: API / Type Serialization System
Affects Versions: 1.17.0
Reporter: Piotr Nowojski


{{org.apache.flink.api.java.typeutils.runtime.PojoSerializerSnapshot#restoreSerializer}}
 is using freshly created {{new ExecutionConfig()}} execution config for the 
restored serializer, which overrides any configuration choices made by the 
user. Most likely this is a dormant bug, since restored serializer shouldn't be 
used for serializing any new data, and the {{ExecutionConfig}} is only used for 
subclasses serializations that haven't been cached. If this is indeed the case, 
I would propose to change it's value to {{null}} and safeguard accesses to that 
field with an {{IllegalStateException}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29645) BatchExecutionKeyedStateBackend is using incorrect ExecutionConfig when creating serializer

2022-10-14 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29645:
--

 Summary: BatchExecutionKeyedStateBackend is using incorrect 
ExecutionConfig when creating serializer
 Key: FLINK-29645
 URL: https://issues.apache.org/jira/browse/FLINK-29645
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.14.6, 1.15.2, 1.13.6, 1.12.7, 1.16.0, 1.17.0
Reporter: Piotr Nowojski


{{org.apache.flink.streaming.api.operators.sorted.state.BatchExecutionKeyedStateBackend#getOrCreateKeyedState}}
 is using freshly constructed {{ExecutionConfig}}, instead of the one 
configured by the user from the environment.


{code:java}
public  S getOrCreateKeyedState(
TypeSerializer namespaceSerializer, StateDescriptor 
stateDescriptor)
throws Exception {
checkNotNull(namespaceSerializer, "Namespace serializer");
checkNotNull(
keySerializer,
"State key serializer has not been configured in the config. "
+ "This operation cannot use partitioned state.");

if (!stateDescriptor.isSerializerInitialized()) {
stateDescriptor.initializeSerializerUnlessSet(new 
ExecutionConfig());
}
{code}

The correct one could be obtained from {{env.getExecutionConfig()}} in 
{{org.apache.flink.streaming.api.operators.sorted.state.BatchExecutionStateBackend#createKeyedStateBackend}}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29584) Upgrade java 11 version on the microbenchmark worker

2022-10-11 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29584:
--

 Summary: Upgrade java 11 version on the microbenchmark worker
 Key: FLINK-29584
 URL: https://issues.apache.org/jira/browse/FLINK-29584
 Project: Flink
  Issue Type: Improvement
  Components: Benchmarks
Affects Versions: 1.17.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.17.0


It looks like the older JDK 11 builds have problems with JIT in the 
microbenchmarks, as for example visible 
[here|http://codespeed.dak8s.net:8000/timeline/?ben=globalWindow=2]. 
Locally I was able to reproduce this problem and the issue goes away after 
upgrading to a newer JDK 11 build.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28835) Savepoint and checkpoint capabilities and limitations table is incorrect

2022-08-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-28835:
--

 Summary: Savepoint and checkpoint capabilities and limitations 
table is incorrect
 Key: FLINK-28835
 URL: https://issues.apache.org/jira/browse/FLINK-28835
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.15.1, 1.16.0
Reporter: Piotr Nowojski
 Fix For: 1.16.0, 1.15.2


https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/

is inconsistent with 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints#FLIP203:Incrementalsavepoints-Proposal.
 "Non-arbitrary job upgrade" for unaligned checkpoints should be officially 
supported. 

It looks like a typo in the original PR FLINK-26134



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-27640) Flink not compiling, flink-connector-hive_2.12 is missing pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde

2022-05-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27640:
--

 Summary: Flink not compiling, flink-connector-hive_2.12 is missing 
pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde 
 Key: FLINK-27640
 URL: https://issues.apache.org/jira/browse/FLINK-27640
 Project: Flink
  Issue Type: Bug
  Components: Build System, Connectors / Hive
Affects Versions: 1.16.0
Reporter: Piotr Nowojski


When clean installing whole project after cleaning local {{.m2}} directory I 
encountered the following error when compiling flink-connector-hive_2.12:
{noformat}
[ERROR] Failed to execute goal on project flink-connector-hive_2.12: Could not 
resolve dependencies for project 
org.apache.flink:flink-connector-hive_2.12:jar:1.16-SNAPSHOT: Failed to collect 
dependencies at org.apache.hive:hive-exec:jar:2.3.9 -> 
org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Failed to read 
artifact descriptor for 
org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Could not transfer 
artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde from/to 
maven-default-http-blocker (http://0.0.0.0/): Blocked mirror for repositories: 
[conjars (http://conjars.org/repo, default, releases+snapshots), 
apache.snapshots (http://repository.apache.org/snapshots, default, snapshots)] 
-> [Help 1]
{noformat}
I've solved this by adding 
{noformat}

spring-repo-plugins
https://repo.spring.io/ui/native/plugins-release/

{noformat}
to ~/.m2/settings.xml file. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27209) Half the Xmx and double the forkCount for unit tests

2022-04-12 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27209:
--

 Summary: Half the Xmx and double the forkCount for unit tests
 Key: FLINK-27209
 URL: https://issues.apache.org/jira/browse/FLINK-27209
 Project: Flink
  Issue Type: Improvement
  Components: Build System
Affects Versions: 1.16.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.16.0


As per recent "Speeding up the test builds" discussion on the dev mailing.

I'm proposing to half the memory allocated for running unit tests (as defined 
per   {{**/*Test.*}} 
property) but at the same time double the forkCounts for those tests. The 
premise is that they shouldn't need as much memory as ITCases to be stable, 
while increasing number of forks, should provide us with a couple of minutes 
improved build times.

CC [~chesnay]




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27202) NullPointerException on stop-with-savepoint with AsyncWaitOperator followed by FlinkKafkaProducer

2022-04-12 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27202:
--

 Summary: NullPointerException on stop-with-savepoint with 
AsyncWaitOperator followed by FlinkKafkaProducer 
 Key: FLINK-27202
 URL: https://issues.apache.org/jira/browse/FLINK-27202
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Runtime / Task
Affects Versions: 1.13.6, 1.12.7
Reporter: Piotr Nowojski


Some lingering mails from {{AsyncWaitOperator}} (or other operators using 
mailbox, or maybe even processing time timers), that are chained with 
{{FlinkKafkaProducer}} can cause the following exceptions when using 
stop-with-savepoint:

{noformat}
2022-04-11 15:46:19,781 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - static 
enrichment -> Map -> Sink: enriched events sink (179/256) 
(3fefa588ad05fa8d2a10a6ad4a740cc6) switched from RUNNING to FAILED on 
10.239.104.67:38149-12df6c @ 10.239.104.67 (dataPort=35745).
java.lang.NullPointerException: null
at 
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction$TransactionHolder.access$000(TwoPhaseCommitSinkFunction.java:591)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:223)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:54)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:50)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.queue.StreamRecordQueueEntry.emitResult(StreamRecordQueueEntry.java:64)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.queue.UnorderedStreamElementQueue$Segment.emitCompleted(UnorderedStreamElementQueue.java:272)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.queue.UnorderedStreamElementQueue.emitCompletedElement(UnorderedStreamElementQueue.java:159)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:287)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:356)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:337)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 

[jira] [Created] (FLINK-27133) Performance regression in serializerHeavyString

2022-04-08 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27133:
--

 Summary: Performance regression in serializerHeavyString
 Key: FLINK-27133
 URL: https://issues.apache.org/jira/browse/FLINK-27133
 Project: Flink
  Issue Type: Bug
  Components: API / Type Serialization System, Benchmarks
Affects Versions: 1.15.0, 1.16.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/#/?exe=1=serializerHeavyString=on=on=off=2=200

Suspected range:
5f21d15a09..caa296b813

I've run a benchmark request before FLINK-26957 and it suggests that it is 
indeed the cause for this regression:
http://codespeed.dak8s.net:8080/job/flink-benchmark-request/77/artifact/jmh-result.csv/*view*/

CC [~mapohl]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27117) FLIP-217: Support watermark alignment of source splits

2022-04-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27117:
--

 Summary: FLIP-217: Support watermark alignment of source splits
 Key: FLINK-27117
 URL: https://issues.apache.org/jira/browse/FLINK-27117
 Project: Flink
  Issue Type: New Feature
  Components: API / DataStream, Runtime / Task
Affects Versions: 1.14.4, 1.13.6, 1.15.0
Reporter: Piotr Nowojski


https://cwiki.apache.org/confluence/display/FLINK/FLIP-217+Support+watermark+alignment+of+source+splits



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26864) Performance regression on 25.03.2022

2022-03-25 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26864:
--

 Summary: Performance regression on 25.03.2022
 Key: FLINK-26864
 URL: https://issues.apache.org/jira/browse/FLINK-26864
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.16.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/#/?exe=1=arrayKeyBy=on=on=off=2=200
http://codespeed.dak8s.net:8000/timeline/#/?exe=1=remoteFilePartition=on=on=off=2=200
http://codespeed.dak8s.net:8000/timeline/#/?exe=1=remoteSortPartition=on=on=off=2=200
http://codespeed.dak8s.net:8000/timeline/#/?exe=1=tupleKeyBy=on=on=off=2=200



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26682) Migrate regression check script to python and to main repository

2022-03-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26682:
--

 Summary: Migrate regression check script to python and to main 
repository
 Key: FLINK-26682
 URL: https://issues.apache.org/jira/browse/FLINK-26682
 Project: Flink
  Issue Type: New Feature
  Components: Benchmarks
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


Script provided by Roman 
https://github.com/rkhachatryan/flink-benchmarks/blob/regression-alerts/regression-alert.sh
 should be merge to the main repo and migrated to python.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26147) Deleting incremental savepoints directory in CLAIM mode

2022-02-15 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26147:
--

 Summary: Deleting incremental savepoints directory in CLAIM mode
 Key: FLINK-26147
 URL: https://issues.apache.org/jira/browse/FLINK-26147
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing
Affects Versions: 1.15.0
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26146) Add test coverage for native format flink upgrades (minor versions)

2022-02-15 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26146:
--

 Summary: Add test coverage for native format flink upgrades (minor 
versions)
 Key: FLINK-26146
 URL: https://issues.apache.org/jira/browse/FLINK-26146
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Tests
Reporter: Piotr Nowojski


Check test coverage for:

Flink minor (1.x → 1.y) version upgrade

Parametrised SavepointMigrationTestBase to work on: 

canonical savepoints

native savepoints 

aligned (retained) checkpoints

Assignee should contact release managers and decided whether to already create 
1.15 (snapshot) artefacts for native savepoints and aligned checkpoints. If we 
do so, we might make release manager work  more complicated. However if we 
don’t, the test will be a dead code until release manager creates those 
artefacts, which can also make the release manager work more difficult if test 
explodes during RC creation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26064) KinesisFirehoseSinkITCase IllegalStateException: Trying to access closed classloader

2022-02-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26064:
--

 Summary: KinesisFirehoseSinkITCase IllegalStateException: Trying 
to access closed classloader
 Key: FLINK-26064
 URL: https://issues.apache.org/jira/browse/FLINK-26064
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kinesis
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=31044=logs=d44f43ce-542c-597d-bf94-b0718c71e5e8=ed165f3f-d0f6-524b-5279-86f8ee7d0e2d

(shortened stack trace, as full is too large)
{noformat}
Feb 09 20:05:04 java.util.concurrent.ExecutionException: 
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute 
HTTP request: Trying to access closed classloader. Please check if you store 
classloaders directly or indirectly in static fields. If the stacktrace 
suggests that the leak occurs in a third party library and cannot be fixed 
immediately, you can disable this check with the configuration 
'classloader.check-leaked-classloader'.
Feb 09 20:05:04 at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
Feb 09 20:05:04 at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
(...)
Feb 09 20:05:04 Caused by: 
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute 
HTTP request: Trying to access closed classloader. Please check if you store 
classloaders directly or indirectly in static fields. If the stacktrace 
suggests that the leak occurs in a third party library and cannot be fixed 
immediately, you can disable this check with the configuration 
'classloader.check-leaked-classloader'.
Feb 09 20:05:04 at 
software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)
(...)
Feb 09 20:05:04 Caused by: java.lang.IllegalStateException: Trying to access 
closed classloader. Please check if you store classloaders directly or 
indirectly in static fields. If the stacktrace suggests that the leak occurs in 
a third party library and cannot be fixed immediately, you can disable this 
check with the configuration 'classloader.check-leaked-classloader'.
Feb 09 20:05:04 at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:164)
Feb 09 20:05:04 at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.getResources(FlinkUserCodeClassLoaders.java:188)
Feb 09 20:05:04 at 
java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:348)
Feb 09 20:05:04 at 
java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393)
Feb 09 20:05:04 at 
java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder$1.run(FactoryFinder.java:352)
Feb 09 20:05:04 at java.security.AccessController.doPrivileged(Native 
Method)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder.findServiceProvider(FactoryFinder.java:341)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder.find(FactoryFinder.java:313)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder.find(FactoryFinder.java:227)
Feb 09 20:05:04 at 
javax.xml.stream.XMLInputFactory.newInstance(XMLInputFactory.java:154)
Feb 09 20:05:04 at 
software.amazon.awssdk.protocols.query.unmarshall.XmlDomParser.createXmlInputFactory(XmlDomParser.java:124)
Feb 09 20:05:04 at 
java.lang.ThreadLocal$SuppliedThreadLocal.initialValue(ThreadLocal.java:284)
Feb 09 20:05:04 at 
java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:180)
Feb 09 20:05:04 at java.lang.ThreadLocal.get(ThreadLocal.java:170)
(...)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25982) Support idleness with watermark alignment

2022-02-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25982:
--

 Summary: Support idleness with watermark alignment
 Key: FLINK-25982
 URL: https://issues.apache.org/jira/browse/FLINK-25982
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25914) Performance regression in checkpointSingleInput.UNALIGNED on 31.01.2022

2022-02-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25914:
--

 Summary: Performance regression in checkpointSingleInput.UNALIGNED 
on 31.01.2022
 Key: FLINK-25914
 URL: https://issues.apache.org/jira/browse/FLINK-25914
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks, Runtime / Checkpointing
Affects Versions: 1.15.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/?ben=checkpointSingleInput.UNALIGNED=2






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25912) HashMapStateBackend performance regression on 31.01.2022

2022-02-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25912:
--

 Summary: HashMapStateBackend performance regression on 31.01.2022
 Key: FLINK-25912
 URL: https://issues.apache.org/jira/browse/FLINK-25912
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks, Runtime / State Backends
Reporter: Piotr Nowojski
 Fix For: 1.15.0


http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=mapEntries.HEAP=2=200=off=on=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=mapIterator.HEAP=2=200=off=on=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=mapKeys.HEAP=2=200=off=on=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=mapValues.HEAP=2=200=off=on=on



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25896) Buffer debloating issues with high parallelism

2022-01-31 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25896:
--

 Summary: Buffer debloating issues with high parallelism
 Key: FLINK-25896
 URL: https://issues.apache.org/jira/browse/FLINK-25896
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Affects Versions: 1.14.3, 1.15.0
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25869) Create a stress test cron build

2022-01-28 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25869:
--

 Summary: Create a stress test cron build
 Key: FLINK-25869
 URL: https://issues.apache.org/jira/browse/FLINK-25869
 Project: Flink
  Issue Type: Improvement
  Components: Test Infrastructure, Tests
Affects Versions: 1.15.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


We propose creating some kind of stress test cron job dedicated for 
periodically running longer lasting stress tests. For example trying to expose 
memory leaks.

First idea to test could be to provide test coverage for FLINK-25728. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25745) Support RocksDB incremental native savepoints

2022-01-21 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25745:
--

 Summary: Support RocksDB incremental native savepoints
 Key: FLINK-25745
 URL: https://issues.apache.org/jira/browse/FLINK-25745
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Piotr Nowojski
 Fix For: 1.15.0


Respect CheckpointType.SharingFilesStrategy#NO_SHARING flag in 
RocksIncrementalSnapshotStrategy. We also need to make sure that 
RocksDBIncrementalSnapshotStrategy is creating self contained/relocatable 
snapshots (using CheckpointedStateScope#EXCLUSIVE for native savepoints)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25744) Support native savepoints (w/o modifying the statebackend specific snapshot strategies)

2022-01-21 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25744:
--

 Summary: Support native savepoints (w/o modifying the statebackend 
specific snapshot strategies)
 Key: FLINK-25744
 URL: https://issues.apache.org/jira/browse/FLINK-25744
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing
Reporter: Piotr Nowojski
Assignee: Dawid Wysakowicz


For example w/o incremental RocksDB support. But HashMap and Full RocksDB 
should be working out of the box w/o extra changes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25704) Performance regression on 18.01.2022 in batch network benchmarks

2022-01-19 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25704:
--

 Summary: Performance regression on 18.01.2022 in batch network 
benchmarks
 Key: FLINK-25704
 URL: https://issues.apache.org/jira/browse/FLINK-25704
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=compressedFilePartition=2=200=off=on=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=uncompressedFilePartition=2=200=off=on=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3=uncompressedMmapPartition=2=200=off=on=on

Suspected range:
{code}
git ls eeec246677..f5c99c6f26
f5c99c6f26 [5 weeks ago] [FLINK-17321][table] Add support casting of map to map 
and multiset to multiset [Sergey Nuyanzin]
745cfec705 [24 hours ago] [hotfix][table-common] Fix InternalDataUtils for 
MapData tests [Timo Walther]
ed699b6ee6 [6 days ago] [FLINK-25637][network] Make sort-shuffle the default 
shuffle implementation for batch jobs [kevin.cyj]
4275525fed [6 days ago] [FLINK-25638][network] Increase the default write 
buffer size of sort-shuffle to 16M [kevin.cyj]
e1878fb899 [6 days ago] [FLINK-25639][network] Increase the default read buffer 
size of sort-shuffle to 64M [kevin.cyj]
{code}

It looks [~kevin.cyj], that most likely your change has caused that?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25688) Resolve performance degradation with high parallelism when using buffer debloating

2022-01-18 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25688:
--

 Summary: Resolve performance degradation with high parallelism 
when using buffer debloating
 Key: FLINK-25688
 URL: https://issues.apache.org/jira/browse/FLINK-25688
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.14.3, 1.15.0
Reporter: Piotr Nowojski


As documented in FLINK-25646, currently buffer debloating in Flink, at least in 
the default configuration, has quite noticeable performance degradation at 
larger scale. For example throughput can drop by a factor of 4, or even 
checkpointing times can be increased. Currently it's not clear why is this 
happening. It looks like increasing the number of buffers per channel from the 
default ~2 to above 3 (for example via bumping number of floating buffers to 
value equal or higher then parallelism), seems to be solving this problem, at 
least on one cluster where buffer debloating has been tested at large scale.

Further investigation is required.

CC [~akalashnikov]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25414) Provide metrics to measure how long task has been blocked

2021-12-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25414:
--

 Summary: Provide metrics to measure how long task has been blocked
 Key: FLINK-25414
 URL: https://issues.apache.org/jira/browse/FLINK-25414
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Metrics, Runtime / Task
Affects Versions: 1.14.2
Reporter: Piotr Nowojski


Currently back pressured/busy metrics tell the user whether task is 
blocked/busy and how much % of the time it is blocked/busy. But they do not 
tell how for how long single block event is lasting. It can be 1ms or 1h and 
back pressure/busy would be still reporting 100%.

In order to improve this, we could provide two new metrics:
# maxSoftBackPressureDuration
# maxHardBackPressureDuration

The max would be reset to 0 periodically or on every access to the metric (via 
metric reporter). Soft back pressure would be if task is back pressured in a 
non blocking fashion (StreamTask detected in availability of the output). Hard 
back pressure would measure the time task is actually blocked.

Unfortunately I don't know how to efficiently provide similar metric for busy 
time, without impacting max throughput.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25407) Network stack deadlock when cancellation happens during initialisation

2021-12-21 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25407:
--

 Summary: Network stack deadlock when cancellation happens during 
initialisation
 Key: FLINK-25407
 URL: https://issues.apache.org/jira/browse/FLINK-25407
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.14.0, 1.15.0
Reporter: Piotr Nowojski


{noformat}
Java stack information for the threads listed above:
===
"Canceler for Source: Custom Source -> Filter (7/12)#14176 
(0fbb8a89616ca7a40e473adad51f236f).":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:420)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Canceler for Map -> Map (6/12)#14176 (6195862d199aa4d52c12f25b39904725).":
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:585)
   - waiting to lock <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:544)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:424)
   - locked <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Map -> Sink: Unnamed (7/12)#14176":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.recycleMemorySegments(NetworkBufferPool.java:256)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegmentsBlocking(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
   - locked <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:497)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:276)
   at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:105)
   at 
org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:965)
   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:652)
   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
   at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.
{noformat}
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28297=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7=19003
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28306=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba=19832



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25382) Failure in "Upload Logs" task

2021-12-20 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25382:
--

 Summary: Failure in "Upload Logs" task
 Key: FLINK-25382
 URL: https://issues.apache.org/jira/browse/FLINK-25382
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


I don't see any error message, but it seems like uploading the logs has failed:

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=bb16d35c-fdfe-5139-f244-9492cbd2050b

for the following build:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=2c7d57b9-7341-5a87-c9af-2cf7cc1a37dc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25276) Support native and incremental savepoints

2021-12-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25276:
--

 Summary: Support native and incremental savepoints
 Key: FLINK-25276
 URL: https://issues.apache.org/jira/browse/FLINK-25276
 Project: Flink
  Issue Type: New Feature
Reporter: Piotr Nowojski


Motivation. Currently with non incremental canonical format savepoints, with 
very large state, both taking and recovery from savepoints can take very long 
time. Providing options to take native format and incremental savepoint would 
alleviate this problem.

In the past the main challenge lied in the ownership semantic and files clean 
up of such incremental savepoints. However with FLINK-25154 implemented some of 
those concerns can be solved. Incremental savepoint could leverage "force full 
snapshot" mode provided by FLINK-25192, to duplicate/copy all of the savepoint 
files out of the Flink's ownership scope.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25275) Weighted KeyGroup assignment

2021-12-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25275:
--

 Summary: Weighted KeyGroup assignment
 Key: FLINK-25275
 URL: https://issues.apache.org/jira/browse/FLINK-25275
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Network
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


Currently key groups are split into key group ranges naively in the simplest 
way. Key groups are split into equally sized continuous ranges (number of 
ranges = parallelism = number of keygroups / size of single keygroup). Flink 
could avoid data skew between key groups, by assigning them to tasks based on 
their "weight". "Weight" could be defined as frequency of an access for the 
given key group. 

Arbitrary, non-continuous, key group assignment (for example TM1 is processing 
kg1 and kg3 while TM2 is processing only kg2) would require extensive changes 
to the state backends for example. However the data skew could be mitigated to 
some extent by creating key group ranges in a more clever way, while keeping 
the key group range continuity. For example TM1 processes range [kg1, kg9], 
while TM2 just [kg10, kg11].

[This branch shows a PoC of such 
approach.|https://github.com/pnowojski/flink/commits/antiskew]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25255) Consider/design implementing State Processor API (FC)

2021-12-10 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25255:
--

 Summary: Consider/design implementing State Processor API (FC)
 Key: FLINK-25255
 URL: https://issues.apache.org/jira/browse/FLINK-25255
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24919) UnalignedCheckpointITCase hangs on Azure

2021-11-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24919:
--

 Summary: UnalignedCheckpointITCase hangs on Azure
 Key: FLINK-24919
 URL: https://issues.apache.org/jira/browse/FLINK-24919
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


Extracted from FLINK-23466

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26304=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7=13067

Nov 10 16:13:03 Starting 
org.apache.flink.test.checkpointing.UnalignedCheckpointITCase#execute[pipeline 
with mixed channels, p = 20, timeout = 0, buffersPerChannel = 1].

>From the log, we can see this case hangs. I guess this seems a new issue which 
>is different from the one reported in this ticket. From the stack, it seems 
>there is something wrong with the checkpoint coordinator, the following thread 
>locked 0x87db4fb8:
{code:java}
2021-11-10T17:14:21.0899474Z Nov 10 17:14:21 "jobmanager-io-thread-2" #12984 
daemon prio=5 os_prio=0 tid=0x7f12e000b800 nid=0x3fb6 runnable 
[0x7f0fcd6d4000]
2021-11-10T17:14:21.0899924Z Nov 10 17:14:21java.lang.Thread.State: RUNNABLE
2021-11-10T17:14:21.0900300Z Nov 10 17:14:21at 
java.util.HashMap$TreeNode.balanceDeletion(HashMap.java:2338)
2021-11-10T17:14:21.0900745Z Nov 10 17:14:21at 
java.util.HashMap$TreeNode.removeTreeNode(HashMap.java:2112)
2021-11-10T17:14:21.0901146Z Nov 10 17:14:21at 
java.util.HashMap.removeNode(HashMap.java:840)
2021-11-10T17:14:21.0901577Z Nov 10 17:14:21at 
java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:301)
2021-11-10T17:14:21.0902002Z Nov 10 17:14:21at 
java.util.HashMap.putVal(HashMap.java:664)
2021-11-10T17:14:21.0902531Z Nov 10 17:14:21at 
java.util.HashMap.putMapEntries(HashMap.java:515)
2021-11-10T17:14:21.0902931Z Nov 10 17:14:21at 
java.util.HashMap.putAll(HashMap.java:785)
2021-11-10T17:14:21.0903429Z Nov 10 17:14:21at 
org.apache.flink.runtime.checkpoint.ExecutionAttemptMappingProvider.getVertex(ExecutionAttemptMappingProvider.java:60)
2021-11-10T17:14:21.0904060Z Nov 10 17:14:21at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.reportStats(CheckpointCoordinator.java:1867)
2021-11-10T17:14:21.0904686Z Nov 10 17:14:21at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1152)
2021-11-10T17:14:21.0905372Z Nov 10 17:14:21- locked <0x87db4fb8> 
(a java.lang.Object)
2021-11-10T17:14:21.0905895Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
2021-11-10T17:14:21.0906493Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1368/705813936.accept(Unknown
 Source)
2021-11-10T17:14:21.0907086Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
2021-11-10T17:14:21.0907698Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1369/1447418658.run(Unknown
 Source)
2021-11-10T17:14:21.0908210Z Nov 10 17:14:21at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2021-11-10T17:14:21.0908735Z Nov 10 17:14:21at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2021-11-10T17:14:21.0909333Z Nov 10 17:14:21at 
java.lang.Thread.run(Thread.java:748) {code}
But other thread is waiting for the lock. I am not familiar with these logics 
and not sure if this is in the right state. Could anyone who is familiar with 
these logics take a look?

 

BTW, concurrent access of HashMap may cause infinite loop,I see in the stack 
that there are multiple threads are accessing HashMap, though I am not sure if 
they are the same instance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-11-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24846:
--

 Summary: AsyncWaitOperator fails during stop-with-savepoint
 Key: FLINK-24846
 URL: https://issues.apache.org/jira/browse/FLINK-24846
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Reporter: Piotr Nowojski
 Attachments: log-jm.txt

{noformat}
Caused by: 
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
 Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
operations.
at 
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at java.lang.Thread.run(Thread.java:829) ~[?:?]

{noformat}

As reported by a user on [the mailing 
list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
{quote}
I failed to stop a job with savepoint with the following message:
Inconsistent execution state after stopping with savepoint. At least one 
execution is still in one of the following states: FAILED, CANCELED. A global 
fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.

The job manager said
 A savepoint was created at 
hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
successfully.
while complaining about
Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
operations.

Is it okay to ignore this kind of error?

Please see the attached files for the detailed context.

FYI, 
- I used the latest 1.14.0
- I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job
- I couldn't reproduce the exception using the same jar so I might not able to 
provide DUBUG messages
{quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24696) Translate how to configure unaligned checkpoints into Chinese

2021-10-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24696:
--

 Summary: Translate how to configure unaligned checkpoints into 
Chinese
 Key: FLINK-24696
 URL: https://issues.apache.org/jira/browse/FLINK-24696
 Project: Flink
  Issue Type: Improvement
  Components: chinese-translation, Documentation
Affects Versions: 1.15.0, 1.14.1
Reporter: Piotr Nowojski
 Fix For: 1.15.0


Page {{content.zh/docs/ops/state/checkpointing_under_backpressure.md}} needs to 
be translated.

Reference english version ticket: FLINK-24670



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24695) Update how to configure unaligned checkpoints in the documentation

2021-10-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24695:
--

 Summary: Update how to configure unaligned checkpoints in the 
documentation
 Key: FLINK-24695
 URL: https://issues.apache.org/jira/browse/FLINK-24695
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.15.0, 1.14.1
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


It looks like we don't have a code example how to enabled unaligned checkpoints 
anywhere in the docs. That should be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24694) Translate "Checkpointing under backpressure" page into Chinese

2021-10-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24694:
--

 Summary: Translate "Checkpointing under backpressure" page into 
Chinese
 Key: FLINK-24694
 URL: https://issues.apache.org/jira/browse/FLINK-24694
 Project: Flink
  Issue Type: Improvement
  Components: chinese-translation, Documentation
Affects Versions: 1.15.0, 1.14.1
Reporter: Piotr Nowojski


Page {{content.zh/docs/ops/state/checkpoints.md}} is very much out of date and 
out of sync of it's english counterpart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24693) Update Chinese version of "Checkpoints" page

2021-10-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24693:
--

 Summary: Update Chinese version of "Checkpoints" page
 Key: FLINK-24693
 URL: https://issues.apache.org/jira/browse/FLINK-24693
 Project: Flink
  Issue Type: Improvement
  Components: chinese-translation, Documentation
Affects Versions: 1.15.0, 1.14.1
Reporter: Piotr Nowojski


Page {{content.zh/docs/ops/state/checkpoints.md}} is very much out of date and 
out of sync of it's english counterpart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24670) Restructure unaligned checkpoints documentation page to "Checkpointing under back pressure"

2021-10-27 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24670:
--

 Summary: Restructure unaligned checkpoints documentation page to 
"Checkpointing under back pressure"
 Key: FLINK-24670
 URL: https://issues.apache.org/jira/browse/FLINK-24670
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.14.1
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0, 1.14.1


https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/unaligned_checkpoints/

* Should be renamed and restructured to "Checkpointing under back pressure"
* Should have a short section that cross references network mem tuning guide
* It looks like we don't have a code example how to enabled unaligned 
checkpoints anywhere in the docs? That should be fixed.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24531) CheckpointingTimeBenchmark can fail with connection timed out

2021-10-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24531:
--

 Summary: CheckpointingTimeBenchmark can fail with connection timed 
out
 Key: FLINK-24531
 URL: https://issues.apache.org/jira/browse/FLINK-24531
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.15.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


There is a race condition between obtaining REST API port and actually starting 
up the REST API server. If REST API server start is delayed, the 
{{CheckpointingTimeBenchmark}} can fail with

{noformat}
java.util.concurrent.ExecutionException: 
org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: 
connection timed out: localhost/127.0.0.1:60503
at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at 
org.apache.flink.benchmark.CheckpointingTimeBenchmark$CheckpointEnvironmentContext.waitForBackpressure(CheckpointingTimeBenchmark.java:251)
at 
org.apache.flink.benchmark.CheckpointingTimeBenchmark$CheckpointEnvironmentContext.setUp(CheckpointingTimeBenchmark.java:216)
at 
org.apache.flink.benchmark.generated.CheckpointingTimeBenchmark_checkpointSingleInput_jmhTest._jmh_tryInit_f_checkpointenvironmentcont
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24483) Document what is Public API and what compatibility guarantees Flink is providing

2021-10-08 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24483:
--

 Summary: Document what is Public API and what compatibility 
guarantees Flink is providing
 Key: FLINK-24483
 URL: https://issues.apache.org/jira/browse/FLINK-24483
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream, Documentation, Table SQL / API
Affects Versions: 1.13.2, 1.12.5, 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


We should document:
* What constitute of the Public API, what do 
Public/PublicEvolving/Experimental/Internal annotations mean.
* What compatibility guarantees we are providing (forward functional, compile 
and binary compatibility for {{@Public}} interfaces)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24475) Remove no longer used NestedMap* classes

2021-10-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24475:
--

 Summary: Remove no longer used NestedMap* classes
 Key: FLINK-24475
 URL: https://issues.apache.org/jira/browse/FLINK-24475
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.13.2, 1.14.0, 1.15.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


After FLINK-21935 all of the {{NestedMapsStateTable}} classes are no longer 
used in the production code. They are still however being used in benchmarks in 
some tests. Benchmarks/tests should be migrated to {{CopyOnWrite}} versions 
while the {{NestedMaps}} classes should be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24472) DispatcherTest is unstable

2021-10-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24472:
--

 Summary: DispatcherTest  is unstable
 Key: FLINK-24472
 URL: https://issues.apache.org/jira/browse/FLINK-24472
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


testCancellationOfNonCanceledTerminalJobFailsWithAppropriateException from 
DispatcherTest can fail with:

{noformat}
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
Oct 07 10:31:18 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
Oct 07 10:31:18 at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner.run(ParentRunner.java:413)
Oct 07 10:31:18 at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
Oct 07 10:31:18 at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
Oct 07 10:31:18 at 
org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
Oct 07 10:31:18 at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
Oct 07 10:31:18 at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
Oct 07 10:31:18 at 
java.util.Iterator.forEachRemaining(Iterator.java:116)
Oct 07 10:31:18 at 
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
Oct 07 10:31:18 at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
Oct 07 10:31:18 at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
Oct 07 10:31:18 at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
Oct 07 10:31:18 at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
Oct 07 10:31:18 at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
Oct 07 10:31:18 at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
Oct 07 10:31:18 at 
org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82)
Oct 07 10:31:18 at 
org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
Oct 07 10:31:18 at 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150)
Oct 07 10:31:18 at 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:120)
Oct 07 10:31:18 at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
Oct 07 10:31:18 at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
Oct 07 10:31:18 at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
Oct 07 10:31:18 at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24471) Add microbenchmark for throughput with debloating enabled

2021-10-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24471:
--

 Summary: Add microbenchmark for throughput with debloating enabled
 Key: FLINK-24471
 URL: https://issues.apache.org/jira/browse/FLINK-24471
 Project: Flink
  Issue Type: Sub-task
  Components: Benchmarks, Runtime / Network
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24439) Introduce a CoordinatorStore

2021-10-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24439:
--

 Summary: Introduce a CoordinatorStore
 Key: FLINK-24439
 URL: https://issues.apache.org/jira/browse/FLINK-24439
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Common
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0


In order to allow {{SourceCoordinator}}s from different {{Source}}s (for 
example two different Kafka sources, or Kafka and Kinesis) to align watermarks, 
they have to be able to exchange information/aggregate watermarks from those 
different Sources. To enable this, we need to provide some {{CoordinatorStore}} 
concept, that would be a thread safe singleton.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24441) Block SourceReader when watermarks are out of alignment

2021-10-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24441:
--

 Summary: Block SourceReader when watermarks are out of alignment
 Key: FLINK-24441
 URL: https://issues.apache.org/jira/browse/FLINK-24441
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Common
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0


SourceReader should become unavailable once it's latest watermark is too far 
into the future



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24440) Announce and combine latest watermarks across SourceReaders

2021-10-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24440:
--

 Summary: Announce and combine latest watermarks across 
SourceReaders
 Key: FLINK-24440
 URL: https://issues.apache.org/jira/browse/FLINK-24440
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Common
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0


# Each SourceReader should inform it's SourceCoordinator about the latest 
watermark that it has emitted so far
# SourceCoordinators should combine those watermarks and broadcast the 
aggregated result back to all SourceReaders



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24357) ZooKeeperLeaderElectionConnectionHandlingTest#testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled fails with an Unhandled error

2021-09-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24357:
--

 Summary: 
ZooKeeperLeaderElectionConnectionHandlingTest#testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled
 fails with an Unhandled error
 Key: FLINK-24357
 URL: https://issues.apache.org/jira/browse/FLINK-24357
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


In a [private 
azure|https://dev.azure.com/pnowojski/637f6470-2732-4605-a304-caebd40e284b/_apis/build/builds/517/logs/155]
 build when testing my own PR I've noticed the following error that looks 
unrelated to any of my changes (modifications to {{Task}} class 
error/cancellation handling logic):

{noformat}

2021-09-22T08:09:16.6244936Z Sep 22 08:09:16 [ERROR] 
testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled  Time 
elapsed: 28.753 s  <<< FAILURE!
2021-09-22T08:09:16.6245821Z Sep 22 08:09:16 java.lang.AssertionError: The 
TestingFatalErrorHandler caught an exception.
2021-09-22T08:09:16.6246513Z Sep 22 08:09:16at 
org.apache.flink.runtime.util.TestingFatalErrorHandlerResource.after(TestingFatalErrorHandlerResource.java:78)
2021-09-22T08:09:16.6247281Z Sep 22 08:09:16at 
org.apache.flink.runtime.util.TestingFatalErrorHandlerResource.access$300(TestingFatalErrorHandlerResource.java:33)
2021-09-22T08:09:16.6248167Z Sep 22 08:09:16at 
org.apache.flink.runtime.util.TestingFatalErrorHandlerResource$1.evaluate(TestingFatalErrorHandlerResource.java:57)
2021-09-22T08:09:16.6248862Z Sep 22 08:09:16at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
2021-09-22T08:09:16.6249620Z Sep 22 08:09:16at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
2021-09-22T08:09:16.6250210Z Sep 22 08:09:16at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
2021-09-22T08:09:16.6250773Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
2021-09-22T08:09:16.6251375Z Sep 22 08:09:16at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
2021-09-22T08:09:16.6251951Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
2021-09-22T08:09:16.6252562Z Sep 22 08:09:16at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
2021-09-22T08:09:16.6253415Z Sep 22 08:09:16at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
2021-09-22T08:09:16.6254469Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
2021-09-22T08:09:16.6255039Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
2021-09-22T08:09:16.6256238Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
2021-09-22T08:09:16.6257109Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
2021-09-22T08:09:16.6257766Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
2021-09-22T08:09:16.6258406Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
2021-09-22T08:09:16.6259050Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner.run(ParentRunner.java:413)
2021-09-22T08:09:16.6259827Z Sep 22 08:09:16at 
org.junit.runner.JUnitCore.run(JUnitCore.java:137)
2021-09-22T08:09:16.6260963Z Sep 22 08:09:16at 
org.junit.runner.JUnitCore.run(JUnitCore.java:115)
2021-09-22T08:09:16.6261796Z Sep 22 08:09:16at 
org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
2021-09-22T08:09:16.6262428Z Sep 22 08:09:16at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
2021-09-22T08:09:16.6263268Z Sep 22 08:09:16at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
2021-09-22T08:09:16.6263875Z Sep 22 08:09:16at 
java.util.Iterator.forEachRemaining(Iterator.java:116)
2021-09-22T08:09:16.6265025Z Sep 22 08:09:16at 
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
2021-09-22T08:09:16.6265940Z Sep 22 08:09:16at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
2021-09-22T08:09:16.6266767Z Sep 22 08:09:16at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
2021-09-22T08:09:16.6267470Z Sep 22 08:09:16at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
2021-09-22T08:09:16.6268165Z Sep 22 08:09:16at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
2021-09-22T08:09:16.6269341Z Sep 22 08:09:16at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
2021-09-22T08:09:16.6269928Z Sep 22 08:09:16at 

[jira] [Created] (FLINK-24344) Handling of IOExceptions when triggering checkpoints doesn't cause job failover

2021-09-21 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24344:
--

 Summary: Handling of IOExceptions when triggering checkpoints 
doesn't cause job failover
 Key: FLINK-24344
 URL: https://issues.apache.org/jira/browse/FLINK-24344
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0, 1.15.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24328) Long term fix for receiving new buffer size before network reader configured

2021-09-17 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24328:
--

 Summary: Long term fix for receiving new buffer size before 
network reader configured
 Key: FLINK-24328
 URL: https://issues.apache.org/jira/browse/FLINK-24328
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.14.0
Reporter: Anton Kalashnikov
Assignee: Anton Kalashnikov
 Fix For: 1.14.0


It happened on the big cluster(parallelism = 75, TM=5, task=8) just on the 
initialization moment.
{noformat}
2021-09-09 14:36:42,383 WARN  org.apache.flink.runtime.taskmanager.Task 
   [] - Map -> Flat Map (71/75)#0 (7a5b971e0cd57aa5d057a114e2679b03) 
switched from RUNNING to FAILED with failure c
ause: 
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Fatal error at remote task manager 
'ip-172-31-22-183.eu-central-1.compute.internal/172.31.22.183:42085'.
at 
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.decodeMsg(CreditBasedPartitionRequestClientHandler.java:339)
at 
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelRead(CreditBasedPartitionRequestClientHandler.java:240)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelRead(NettyMessageClientDecoderDelegate.java:112)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
at 
org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
at 
org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at 
org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: No reader for receiverId = 
296559f497c54a82534945f4549b9e2d exists.
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.obtainReader(PartitionRequestQueue.java:194)
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.notifyNewBufferSize(PartitionRequestQueue.java:188)
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:134)
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:42)
at 
org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
at 

[jira] [Created] (FLINK-24184) Potential race condition leading to incorrectly issued interruptions

2021-09-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24184:
--

 Summary: Potential race condition leading to incorrectly issued 
interruptions
 Key: FLINK-24184
 URL: https://issues.apache.org/jira/browse/FLINK-24184
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.13.2, 1.12.5, 1.11.4, 1.10.3, 1.9.3, 1.8.3, 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


There is a  race condition in disabling interrupts while closing resources. 
Currently this is guarded by a volatile variable, but there might be a race 
condition when:
1. interrupter thread first checked the shouldInterruptOnCancel flag
2. shouldInterruptOnCancel flag switched to false as Task/StreamTask entered 
cleaning up phase
3. interrupter issued an interrupt while Task/StreamTask are closing/releasing 
resources, potentially causing a memory leak



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24155) Translate documentation for how to configure the CheckpointFailureManager

2021-09-03 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24155:
--

 Summary: Translate documentation for how to configure the 
CheckpointFailureManager
 Key: FLINK-24155
 URL: https://issues.apache.org/jira/browse/FLINK-24155
 Project: Flink
  Issue Type: Improvement
  Components: chinese-translation, Documentation
Affects Versions: 1.13.2, 1.14.0, 1.15.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0


Documentation added in FLINK-23916 should be translated to it's Chinese 
counterpart. Note that this applies to three separate commits:
merged to master as cd01d4c0279
merged to release-1.14 as 2e769746bf2
merged to release-1.13 as e1a71219454



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24154) Translate documentation for how to configure the CheckpointFailureManager

2021-09-03 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24154:
--

 Summary: Translate documentation for how to configure the 
CheckpointFailureManager
 Key: FLINK-24154
 URL: https://issues.apache.org/jira/browse/FLINK-24154
 Project: Flink
  Issue Type: Sub-task
  Components: chinese-translation, Documentation
Reporter: Armstrong Nova
Assignee: wangzhao


Translate 
"[https://ci.apache.org/projects/flink/flink-docs-master/monitoring/historyserver.html];
 page into Chinese.

This doc located in "flink/docs/monitoring/historyserver.zh.md"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24030) PulsarSourceITCase>SourceTestSuiteBase.testMultipleSplits failed

2021-08-27 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24030:
--

 Summary: PulsarSourceITCase>SourceTestSuiteBase.testMultipleSplits 
failed
 Key: FLINK-24030
 URL: https://issues.apache.org/jira/browse/FLINK-24030
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Pulsar
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22936=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461
root cause:

{noformat}
Aug 27 09:41:42 Caused by: 
org.apache.pulsar.client.api.PulsarClientException$BrokerMetadataException: 
Consumer not found
Aug 27 09:41:42 at 
org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:987)
Aug 27 09:41:42 at 
org.apache.pulsar.client.impl.PulsarClientImpl.close(PulsarClientImpl.java:658)
Aug 27 09:41:42 at 
org.apache.flink.connector.pulsar.source.reader.source.PulsarSourceReaderBase.close(PulsarSourceReaderBase.java:83)
Aug 27 09:41:42 at 
org.apache.flink.connector.pulsar.source.reader.source.PulsarOrderedSourceReader.close(PulsarOrderedSourceReader.java:170)
Aug 27 09:41:42 at 
org.apache.flink.streaming.api.operators.SourceOperator.close(SourceOperator.java:308)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:141)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:127)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1015)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:859)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:747)
Aug 27 09:41:42 at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
Aug 27 09:41:42 at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
Aug 27 09:41:42 at 
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
Aug 27 09:41:42 at 
org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
{noformat}
Top level error:
{noformat}

WARNING: The following warnings have been detected: WARNING: Return type, 
java.util.Map, of method, 
public java.util.Map 
org.apache.pulsar.broker.admin.impl.ClustersBase.getNamespaceIsolationPolicies(java.lang.String)
 throws java.lang.Exception, is not resolvable to a concrete type.

Aug 27 09:41:42 [ERROR] Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time 
elapsed: 357.849 s <<< FAILURE! - in 
org.apache.flink.connector.pulsar.source.PulsarSourceITCase
Aug 27 09:41:42 [ERROR] testMultipleSplits{TestEnvironment, ExternalContext}[1] 
 Time elapsed: 5.391 s  <<< ERROR!
Aug 27 09:41:42 java.lang.RuntimeException: Failed to fetch next result
Aug 27 09:41:42 at 
org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:109)
Aug 27 09:41:42 at 
org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
Aug 27 09:41:42 at 
org.apache.flink.connectors.test.common.utils.TestDataMatchers$MultipleSplitDataMatcher.matchesSafely(TestDataMatchers.java:151)
Aug 27 09:41:42 at 
org.apache.flink.connectors.test.common.utils.TestDataMatchers$MultipleSplitDataMatcher.matchesSafely(TestDataMatchers.java:133)
Aug 27 09:41:42 at 
org.hamcrest.TypeSafeDiagnosingMatcher.matches(TypeSafeDiagnosingMatcher.java:55)
Aug 27 09:41:42 at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:12)
Aug 27 09:41:42 at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8)
Aug 27 09:41:42 at 
org.apache.flink.connectors.test.common.testsuites.SourceTestSuiteBase.testMultipleSplits(SourceTestSuiteBase.java:156)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23988) Test corner cases

2021-08-26 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23988:
--

 Summary: Test corner cases
 Key: FLINK-23988
 URL: https://issues.apache.org/jira/browse/FLINK-23988
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Reporter: Piotr Nowojski


Check how debloating behaves in case of:
* data skew
* multiple/two/union inputs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23776) Performance regression on 14.08.2021 in FLIP-27

2021-08-14 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23776:
--

 Summary: Performance regression on 14.08.2021 in FLIP-27
 Key: FLINK-23776
 URL: https://issues.apache.org/jira/browse/FLINK-23776
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream, Benchmarks
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Arvid Heise
 Fix For: 1.14.0


http://codespeed.dak8s.net:8000/timeline/?ben=mapSink.F27_UNBOUNDED=2
http://codespeed.dak8s.net:8000/timeline/?ben=mapRebalanceMapSink.F27_UNBOUNDED=2


{noformat}
git ls 7b60a964b1..7f3636f6b4
7f3636f6b4f [2 days ago] [FLINK-23652][connectors] Adding common source 
metrics. [Arvid Heise]
97c8f72b813 [3 months ago] [FLINK-23652][connectors] Adding common sink 
metrics. [Arvid Heise]
48da20e8f88 [3 months ago] [FLINK-23652][test] Adding InMemoryMetricReporter 
and using it by default in MiniClusterResource. [Arvid Heise]
63ee60859ca [3 months ago] [FLINK-23652][core/metrics] Extract 
Operator(IO)MetricGroup interfaces and expose them in RuntimeContext [Arvid 
Heise]
5d5e39b614b [2 days ago] [refactor][connectors] Only use 
MockSplitReader.Builder for instantiation. [Arvid Heise]
b927035610c [3 months ago] [refactor][core] Extract common context creation in 
CollectionExecutor [Arvid Heise]
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23772) Double check if non-keyed FullyFinishedOperatorState can be mixed up with non finished OperatorState on recovery

2021-08-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23772:
--

 Summary: Double check if non-keyed FullyFinishedOperatorState can 
be mixed up with non finished OperatorState on recovery
 Key: FLINK-23772
 URL: https://issues.apache.org/jira/browse/FLINK-23772
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0


I'm not sure if with non-keyed state we have an issue that it can be reshuffled 
to different operators during recovery. Are there any guarantees that if 
subtask 1 has state A, while subtask 2 has B, that after recovery it won’t be 
rotated?
# is this an issue?
# if so, if we have partially finished tasks with some operators having, 
{{FullyFinishedOperatorState}}, what prevents 
{{VerticesFinishedCache.calculateIfFinished}} from failing if the 
{{FullyFinishedOperatorState}} gets assigned to an operator chain with non 
finished operator?

For example an operator chain with parallelism of two, non-keyed, before 
recovery:
{noformat}
src1 (finished state) -> foo1 (finished state)
src2 -> foo2
{noformat}
Can we end up after recovery with:
{noformat}
src1 (finished state) -> foo2
src2 -> foo1 (finished state)
{noformat}
?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23771) FLIP-147 Checkpoint N has not been started at all

2021-08-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23771:
--

 Summary: FLIP-147 Checkpoint N has not been started at all
 Key: FLINK-23771
 URL: https://issues.apache.org/jira/browse/FLINK-23771
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0


Observed when running WIP version of https://github.com/apache/flink/pull/16773/
{noformat}
6311 [flink-akka.actor.default-dispatcher-6] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - transform-2-keyed 
(4/4) (7f9e36a73792074afd7803e39f563fac) switched from RUNNING to FAILED on 
dac2cc5f-41f8-4067-99a9-ae56341e3735 @ localhost (dataPort=-1).
java.lang.Exception: Could not perform checkpoint 4 for operator 
transform-2-keyed (4/4)#1.
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1170)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$11(StreamTask.java:1117)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:851)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:754)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:787)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:730) 
~[classes/:?]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:786) 
~[classes/:?]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:572) 
~[classes/:?]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
Caused by: java.lang.IllegalStateException: Checkpoint 4 has not been started 
at all
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.getAllBarriersReceivedFuture(SingleCheckpointBarrierHandler.java:464)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.getAllBarriersReceivedFuture(CheckpointedInputGate.java:209)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.prepareSnapshot(StreamTaskNetworkInput.java:112)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.prepareSnapshot(StreamOneInputProcessor.java:83)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.prepareInputSnapshot(StreamTask.java:458)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.prepareInflightDataSnapshot(SubtaskCheckpointCoordinatorImpl.java:554)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.initInputsCheckpoint(SubtaskCheckpointCoordinatorImpl.java:423)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1154)
 ~[classes/:?]
... 13 more
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23741) Waiting for final checkpoint can deadlock job

2021-08-12 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23741:
--

 Summary: Waiting for final checkpoint can deadlock job
 Key: FLINK-23741
 URL: https://issues.apache.org/jira/browse/FLINK-23741
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing, Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


With {{ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH}} enabled, final checkpoint can 
deadlock (or timeout after very long time) if there is a race condition between 
selecting tasks to trigger checkpoint on and finishing tasks. FLINK-21246 was 
supposed to handle it, but it doesn't work as expected, because futures from:
org.apache.flink.runtime.taskexecutor.TaskExecutor#triggerCheckpoint
and
org.apache.flink.streaming.runtime.tasks.StreamTask#triggerCheckpointAsync
are not linked together. TaskExecutor#triggerCheckpoint reports that checkpoint 
has been successfully triggered, while {{StreamTask}} might have actually 
finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23708) When task is finishedOnRestore Operators shouldn't be used

2021-08-10 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23708:
--

 Summary: When task is finishedOnRestore Operators shouldn't be used
 Key: FLINK-23708
 URL: https://issues.apache.org/jira/browse/FLINK-23708
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.14.0


When task is finishedOnRestore Operators shouldn't be used, invoked or even 
initialised. Currently at least a couple of methods are being invoked like:
* endInput()
* getMetricGroup()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23704) FLIP-27 sources are not generating LatencyMarkers

2021-08-10 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23704:
--

 Summary: FLIP-27 sources are not generating LatencyMarkers
 Key: FLINK-23704
 URL: https://issues.apache.org/jira/browse/FLINK-23704
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream, Runtime / Task
Affects Versions: 1.13.2, 1.12.5, 1.14.0
Reporter: Piotr Nowojski


Currently {{LatencyMarker}} is created only in 
{{StreamSource.LatencyMarksEmitter#LatencyMarksEmitter}}. FLIP-27 sources are 
never emitting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23698) Pass watermarks in finished on restore operators

2021-08-10 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23698:
--

 Summary: Pass watermarks in finished on restore operators
 Key: FLINK-23698
 URL: https://issues.apache.org/jira/browse/FLINK-23698
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.14.0


After merging FLINK-23541 there is a bug on restore finished tasks that we are 
loosing an information that max watermark has been already emitted. As task is 
finished, it means no new watermark will be ever emitted, while downstream 
tasks (for example two/multiple input tasks) would be deadlocked waiting for a 
watermark from an already finished input.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23648) Drop StreamTwoInputProcessor in favour of StreamMultipleInputProcessor

2021-08-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23648:
--

 Summary: Drop StreamTwoInputProcessor in favour of 
StreamMultipleInputProcessor
 Key: FLINK-23648
 URL: https://issues.apache.org/jira/browse/FLINK-23648
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.14.0


StreamTwoInputProcessor is less efficient and can be simply replaced with 
StreamMultipleInputProcessor. This also solves a performance problem for 
FLINK-23541, where initial version was causing a performance regression in two 
input processor, while multiple input was working fine. Hence instead of 
optimising StreamTwoInputProcessor, it's just simpler to drop it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23593) Performance regression on 15.07.2021

2021-08-03 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23593:
--

 Summary: Performance regression on 15.07.2021
 Key: FLINK-23593
 URL: https://issues.apache.org/jira/browse/FLINK-23593
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput=2
http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput=2


{noformat}
pnowojski@piotr-mbp: [~/flink -  ((no branch, bisect started on pr/16589))] $ 
git ls f4afbf3e7de..eb8100f7afe
eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) 
[FLINK-22017][coordination] Allow BLOCKING result partition to be individually 
consumable [Thesharing]
d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) 
[FLINK-22017][coordination] Get the ConsumedPartitionGroup that 
IntermediateResultPartition and DefaultResultPartition belong to [Thesharing]
d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable 
AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther]
4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, 
refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) 
[FLINK-21928][clients][runtime] Introduce static method constructors of 
DuplicateJobSubmissionException for better readability. [David Moravek]
172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should 
succeed, when trying to resubmit already terminated job in application mode. 
[David Moravek]
f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce 
org.apache.flink.util.concurrent.FutureUtils#handleException method, that 
allows future to recover from the specied exception. [David Moravek]
d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, 
refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) 
[FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang]
16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related tests to 
a separate test class. [Yun Gao]
31b3d37a22c [7 weeks ago] [FLINK-21089][runtime] Skip the execution of new 
sources if finished on restore [Yun Gao]
20fe062e1b5 [3 weeks ago] [FLINK-21089][runtime] Skip execution for the legacy 
source task if finished on restore [Yun Gao]
874c627114b [3 weeks ago] [FLINK-21089][runtime] Skip the lifecycle method of 
operators if finished on restore [Yun Gao]
ceaf24b1d88 [3 weeks ago] (pn/bisect-1, bisect-1, 
refs/bisect/good-ceaf24b1d881c2345a43f305d40435519a09cec9) [hotfix] Fix 
isClosed() for operator wrapper and proxy operator close to the operator chain 
[Yun Gao]
41ea591a6db [3 weeks ago] [FLINK-22627][runtime] Remove unused slot request 
protocol [Yangze Guo]
489346b60f8 [3 months ago] [FLINK-22627][runtime] Remove PendingSlotRequest 
[Yangze Guo]
8ffb4d2af36 [3 months ago] [FLINK-22627][runtime] Remove TaskManagerSlot 
[Yangze Guo]
72073741588 [3 months ago] [FLINK-22627][runtime] Remove SlotManagerImpl and 
its related tests [Yangze Guo]
bdb3b7541b3 [3 months ago] [hotfix][yarn] Remove unused internal options in 
YarnConfigOptionsInternal [Yangze Guo]
a6a9b192eac [3 weeks ago] [FLINK-23201][streaming] Reset alignment only for the 
currently processed checkpoint [Anton Kalashnikov]
b35701a35c7 [3 weeks ago] [FLINK-23201][streaming] Calculate checkpoint 
alignment time only for last started checkpoint [Anton Kalashnikov]
3abec22c536 [3 weeks ago] [FLINK-23107][table-runtime] Separate implementation 
of deduplicate rank from other rank functions [Shuo Cheng]
1a195f5cc59 [3 weeks ago] [FLINK-16093][docs-zh] Translate "System Functions" 
page of "Functions" into Chinese (#16348) [ZhiJie Yang]
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23560) Performance regression on 29.07.2021

2021-07-30 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23560:
--

 Summary: Performance regression on 29.07.2021
 Key: FLINK-23560
 URL: https://issues.apache.org/jira/browse/FLINK-23560
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.14.0


http://codespeed.dak8s.net:8000/timeline/?ben=remoteFilePartition=2
http://codespeed.dak8s.net:8000/timeline/?ben=uncompressedMmapPartition=2
http://codespeed.dak8s.net:8000/timeline/?ben=compressedFilePartition=2
http://codespeed.dak8s.net:8000/timeline/?ben=tupleKeyBy=2
http://codespeed.dak8s.net:8000/timeline/?ben=arrayKeyBy=2
http://codespeed.dak8s.net:8000/timeline/?ben=uncompressedFilePartition=2
http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput=2
http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput=2
http://codespeed.dak8s.net:8000/timeline/?ben=globalWindow=2
(And potentially other benchmarks)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23559) Enable periodic materialisation in tests

2021-07-30 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23559:
--

 Summary: Enable periodic materialisation in tests
 Key: FLINK-23559
 URL: https://issues.apache.org/jira/browse/FLINK-23559
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Roman Khachatryan
Assignee: Roman Khachatryan
 Fix For: 1.14.0


FLINK-21448 adds the capability (test randomization), but it can't be turned on 
as there are some test failures: FLINK-23276, FLINK-23277, FLINK-23278 (should 
be enabled after those bugs fixed)..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23528) stop-with-savepoint can fail with FlinkKinesisConsumer

2021-07-28 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23528:
--

 Summary: stop-with-savepoint can fail with FlinkKinesisConsumer
 Key: FLINK-23528
 URL: https://issues.apache.org/jira/browse/FLINK-23528
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kinesis
Affects Versions: 1.12.4, 1.13.1
Reporter: Piotr Nowojski
 Fix For: 1.14.0


{{FlinkKinesisConsumer#cancel()}}  (inside 
{{KinesisDataFetcher#shutdownFetcher()}}) shouldn't be interrupting source 
thread. Otherwise, as described in FLINK-23527, network stack can be left in an 
invalid state and downstream tasks can encounter deserialisation errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23527) Clarify `SourceFunction#cancel()` contract about interrupting

2021-07-28 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23527:
--

 Summary: Clarify `SourceFunction#cancel()` contract about 
interrupting
 Key: FLINK-23527
 URL: https://issues.apache.org/jira/browse/FLINK-23527
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.13.1
Reporter: Piotr Nowojski
 Fix For: 1.14.0


We should clarify the contract of {{SourceFunction#cancel()}}

# source itself shouldn’t be interrupting the source thread
# interrupt shouldn’t be expected in the clean cancellation case

Interrupting the code on the clean shutdown path can cause failures when doing 
`stop-with-savepoint`. If source thread is interrupted during backpressure, 
this leaves network stack in invalid state, making it impossible to send 
{{EndOfPartitionEvent}} to complete the shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23398) State backend benchmarks are failing

2021-07-15 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23398:
--

 Summary: State backend benchmarks are failing
 Key: FLINK-23398
 URL: https://issues.apache.org/jira/browse/FLINK-23398
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks, Runtime / State Backends
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Yun Tang
 Fix For: 1.14.0


http://codespeed.dak8s.net:8080/job/flink-statebackend-benchmark/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23382) Performance regression on 13.07.2021

2021-07-14 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23382:
--

 Summary: Performance regression on 13.07.2021
 Key: FLINK-23382
 URL: https://issues.apache.org/jira/browse/FLINK-23382
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks, Runtime / Network
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0


It looks like there was a performance regression after merging FLINK-16641

http://codespeed.dak8s.net:8000/timeline/?ben=readFileSplit=2
http://codespeed.dak8s.net:8000/timeline/?ben=networkLatency1to1=2
http://codespeed.dak8s.net:8000/timeline/?ben=networkSkewedThroughput=2
(and a couple of other benchmarks)

{code}
$ git ls 4fddcb27c5..657df14677
657df14677a [2 days ago] [FLINK-23356][hbase] Do not use delegation token in 
case of keytab [Gabor Somogyi]
5aba6167343 [3 days ago] [FLINK-21088][runtime][checkpoint] Pass the finish on 
restore status to operator chain [Yun Gao]
df99b49e3f4 [2 days ago] [FLINK-21088][runtime][checkpoint] Pass the finished 
on restore status to TaskStateManager [Yun Gao]
1251ab8566f [2 days ago] [FLINK-21088][runtime][checkpoint] Assign a finished 
snapshot to task if operators are all finished on restore [Yun Gao]
83c5e710673 [33 hours ago] [FLINK-23150][table-planner] Remove the old code 
split implementation [tsreaper]
2e374954b9b [2 days ago] [FLINK-23359][test] Fix the number of available slots 
in testResourceCanBeAllocatedForDifferentJobAfterFree [Yangze Guo]
2268baf211f [5 days ago] [FLINK-23235][connector] Fix SinkITCase instability 
[GuoWei Ma]
60d015cfc65 [6 days ago] [FLINK-16641][network] (Part#6) Enable to set network 
buffers per channel to 0 [kevin.cyj]
7d1bb5f5e0b [6 days ago] [FLINK-16641][network] (Part#5) Send empty buffers to 
the downstream tasks to release the allocated credits if the exclusive credit 
is 0 [kevin.cyj]
985382020b1 [6 days ago] [FLINK-16641][network] (Part#4) Release all allocated 
floating buffers of RemoteInputChannel on receiving any channel blocking event 
if the exclusive credit is 0 [kevin.cyj]
1bf36cbcbbf [6 days ago] [FLINK-16641][network] (Part#3) Support to announce 
the upstream backlog to the downstream tasks [kevin.cyj]
31e1cd174d6 [6 days ago] [FLINK-16641][network] (Part#2) Distinguish data 
buffer and event buffer for BoundedBlockingSubpartitionDirectTransferReader 
[kevin.cyj]
3a46604c068 [7 days ago] [FLINK-16641][network] (Part#1) Introduce a new 
network message BacklogAnnouncement which can bring the upstream buffer backlog 
to the downstream [kevin.cyj]
1746146ab9e [5 days ago] [hotfix] Fix typos in NettyShuffleUtils [kevin.cyj]
73fcb75352b [4 months ago] [hotfix] Simplify RemoteInputChannel#onSenderBacklog 
and call the existing method directly [kevin.cyj]
dc80bb07fb9 [4 months ago] [hotfix] Remove outdated comments in UnionInputGate 
[kevin.cyj]
790576b544e [4 months ago] [hotfix] Remove redundant if condition in 
BufferManager [kevin.cyj]
cf7d7258c5e [2 days ago] [hotfix][tests] Remove unnecessary sleep from 
RemoteAkkaRpcActorTest.failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable
 [Till Rohrmann]
4a3e6d6a769 [2 months ago] [FLINK-11103][runtime] Set a configurable uncaught 
exception handler for all entrypoints [Ashwin Kolhatkar]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23078) Scheduler Benchmarks not compiling

2021-06-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23078:
--

 Summary: Scheduler Benchmarks not compiling
 Key: FLINK-23078
 URL: https://issues.apache.org/jira/browse/FLINK-23078
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks, Runtime / Coordination
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0


{code:java}
07:46:50  [ERROR] 
/home/jenkins/workspace/flink-master-benchmarks/flink-benchmarks/src/main/java/org/apache/flink/scheduler/benchmark/SchedulerBenchmarkBase.java:21:44:
  error: cannot find symbol
{code}

CC [~chesnay] [~Thesharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23011) FLIP-27 sources are generating non-deterministic results when using event time

2021-06-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23011:
--

 Summary: FLIP-27 sources are generating non-deterministic results 
when using event time
 Key: FLINK-23011
 URL: https://issues.apache.org/jira/browse/FLINK-23011
 Project: Flink
  Issue Type: New Feature
  Components: API / DataStream
Affects Versions: 1.12.4, 1.13.1, 1.14.0
 Environment: 

Reporter: Piotr Nowojski


FLIP-27 sources currently start in the {{StreamStatus.IDLE}} state and they 
switch to {{ACTIVE}} only after emitting first {{Watermark}}. Until this 
happens, downstream operators are ignoring {{IDLE}} inputs from calculating the 
input (min) watermark. 

An extreme example to what problem this leads to, are completely bogus results 
if for example one FLIP-27 source subtask is slower than others for some reason:
{code:java}
env.getConfig().setAutoWatermarkInterval(2000);
env.setParallelism(2);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, 
10));

DataStream eventStream =
env.fromSource(
new NumberSequenceSource(0, Long.MAX_VALUE),
WatermarkStrategy.forMonotonousTimestamps()
.withTimestampAssigner(new 
LongTimestampAssigner()),
"NumberSequenceSource")
.map(
new RichMapFunction() {
@Override
public Long map(Long value) throws Exception {
if (getRuntimeContext().getIndexOfThisSubtask() 
== 0) {
Thread.sleep(1);
}
return 1L;
}
});

eventStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(1))).sum(0).print();

(...)
private static class LongTimestampAssigner implements 
SerializableTimestampAssigner {
private long counter = 0;

@Override
public long extractTimestamp(Long record, long recordTimeStamp) {
return counter++;
}
}
{code}
In such case, after 2 seconds ({{setAutoWatermarkInterval}}) the not throttled 
subtask (subTaskId == 1) generates very high watermarks. The other source 
subtask (subTaskId == 0) emits very low watermarks. If the non throttled 
watermark reaches the downstream {{WindowOperator}} first, while the other 
input channel is still idle, it will take those high watermarks as combined 
input watermark for the the whole {{WindowOperator}}. When the input channel 
from the throttled source subtask finally receives it's {{ACTIVE}} status and a 
much lower watermark, that's already too late.

Actual output of the example program:
{noformat}
1596
2000
1000
1000
1000
1000
1000
1000
(...)
{noformat}
while the expected output should be always "2000" (2000 records fitting in 
every 1 second global window)
{noformat}
2000
2000
2000
2000
(...)
{noformat}.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22973) Provide benchmark for unaligned checkpoints performance

2021-06-11 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-22973:
--

 Summary: Provide benchmark for unaligned checkpoints performance
 Key: FLINK-22973
 URL: https://issues.apache.org/jira/browse/FLINK-22973
 Project: Flink
  Issue Type: New Feature
  Components: Benchmarks, Runtime / Network
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.14.0


Provide some simple benchmark to test speed of the unaligned checkpoints on our 
continuous benchmarking infrastructure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22904) Performance regression on 25.05.2020 in mapRebalanceMapSink

2021-06-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-22904:
--

 Summary: Performance regression on 25.05.2020 in 
mapRebalanceMapSink
 Key: FLINK-22904
 URL: https://issues.apache.org/jira/browse/FLINK-22904
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/?ben=mapRebalanceMapSink=2

Suspected range in which this regression happened 21c44688e98..80ad5b3b51



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22882) Tasks are blocked while emitting records

2021-06-04 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-22882:
--

 Summary: Tasks are blocked while emitting records
 Key: FLINK-22882
 URL: https://issues.apache.org/jira/browse/FLINK-22882
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network, Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


On a cluster I observed symptoms of tasks being blocked for long time, causing 
long delays with unaligned checkpointing. 99% of those cases were caused by 
`broadcastEmit` of the stream status

{noformat}
2021-06-04 14:41:44,049 ERROR 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool   [] - Blocking wait 
[11059 ms] for an available buffer.
java.lang.Exception: Stracktracegenerator
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:323)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:290)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewBufferBuilderFromPool(BufferWritingResultPartition.java:338)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewUnicastBufferBuilder(BufferWritingResultPartition.java:314)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.appendUnicastDataForNewRecord(BufferWritingResultPartition.java:246)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.emitRecord(BufferWritingResultPartition.java:142)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:104)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.broadcastEmit(ChannelSelectorRecordWriter.java:67)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.writeStreamStatus(RecordWriterOutput.java:136)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.streamstatus.AnnouncedStatus.ensureActive(AnnouncedStatus.java:65)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:103)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:90)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:44)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:56)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:29)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.ChainingOutput.pushToOperator(ChainingOutput.java:101)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:82)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:39)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:182)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:98)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:294)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 

[jira] [Created] (FLINK-22881) Tasks are blocked while emitting stream status

2021-06-04 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-22881:
--

 Summary: Tasks are blocked while emitting stream status
 Key: FLINK-22881
 URL: https://issues.apache.org/jira/browse/FLINK-22881
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network, Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


On a cluster I observed symptoms of tasks being blocked for long time, causing 
long delays with unaligned checkpointing. 99% of those cases were caused by 
`broadcastEmit` of the stream status
```
2021-06-04 14:41:44,049 ERROR 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool   [] - Blocking wait 
[11059 ms] for an available buffer.
java.lang.Exception: Stracktracegenerator
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:323)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:290)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewBufferBuilderFromPool(BufferWritingResultPartition.java:338)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewUnicastBufferBuilder(BufferWritingResultPartition.java:314)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.appendUnicastDataForNewRecord(BufferWritingResultPartition.java:246)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.emitRecord(BufferWritingResultPartition.java:142)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:104)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.broadcastEmit(ChannelSelectorRecordWriter.java:67)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.writeStreamStatus(RecordWriterOutput.java:136)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.streamstatus.AnnouncedStatus.ensureActive(AnnouncedStatus.java:65)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:103)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:90)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:44)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:56)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:29)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.ChainingOutput.pushToOperator(ChainingOutput.java:101)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:82)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:39)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:182)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:98)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:294)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 

[jira] [Created] (FLINK-22833) Source tasks (both old and new) are not reporting checkpointStartDelay via CheckpointMetrics

2021-06-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-22833:
--

 Summary: Source tasks (both old and new) are not reporting 
checkpointStartDelay via CheckpointMetrics
 Key: FLINK-22833
 URL: https://issues.apache.org/jira/browse/FLINK-22833
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
Affects Versions: 1.12.4, 1.13.1
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22814) New sources are not defining/exposing checkpointStartDelayNanos metric

2021-05-31 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-22814:
--

 Summary: New sources are not defining/exposing 
checkpointStartDelayNanos metric
 Key: FLINK-22814
 URL: https://issues.apache.org/jira/browse/FLINK-22814
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream, Runtime / Metrics
Affects Versions: 1.12.4, 1.13.1
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


checkpointStartDelayNanos metric for new (FLIP-27) sources is always 0ms in the 
WebUI, regardless how long it took to finally start triggering the checkpoint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >