[jira] [Created] (FLINK-20469) Enable TaskManager start and terminate in MiniCluster
Andrey Zagrebin created FLINK-20469: --- Summary: Enable TaskManager start and terminate in MiniCluster Key: FLINK-20469 URL: https://issues.apache.org/jira/browse/FLINK-20469 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.12.0 Currently we expose startTaskManager/terminateTaskManager only in internal TestingMiniCluster. Nonetheless, they are useful methods to implement IT cases similar to E2E tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20468) Enable leadership control in MiniCluster to test JM failover
Andrey Zagrebin created FLINK-20468: --- Summary: Enable leadership control in MiniCluster to test JM failover Key: FLINK-20468 URL: https://issues.apache.org/jira/browse/FLINK-20468 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.12.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20290) Duplicated output in FileSource continuous ITCase with TM failover
Andrey Zagrebin created FLINK-20290: --- Summary: Duplicated output in FileSource continuous ITCase with TM failover Key: FLINK-20290 URL: https://issues.apache.org/jira/browse/FLINK-20290 Project: Flink Issue Type: Bug Components: Connectors / FileSystem Affects Versions: 1.12.0 Reporter: Andrey Zagrebin If FileSourceTextLinesITCase::testContinuousTextFileSource includes TM restarts (after failing TM with TestingMiniCluster::terminateTaskExecutor, see testContinuousTextFileSourceWithTaskManagerFailover in [branch|https://github.com/azagrebin/flink/tree/FLINK-20118-it]) then sometimes I observe duplicated lines in the output after running the test suite 5-10 times in IDE: {code:java} Test testContinuousTextFileSourceWithTaskManagerFailover(org.apache.flink.connector.file.src.FileSourceTextLinesITCase) failed with: java.lang.AssertionError: Expected: ["And by opposing end them?--To die,--to sleep,--", "And enterprises of great pith and moment,", "And lose the name of action.--Soft you now!", "And makes us rather bear those ills we have", "And thus the native hue of resolution", "Be all my sins remember'd.", "But that the dread of something after death,--", "Devoutly to be wish'd. To die,--to sleep;--", "For in that sleep of death what dreams may come,", "For who would bear the whips and scorns of time,", "Is sicklied o'er with the pale cast of thought;", "Must give us pause: there's the respect", "No more; and by a sleep to say we end", "No traveller returns,--puzzles the will,", "Or to take arms against a sea of troubles,", "Than fly to others that we know not of?", "That flesh is heir to,--'tis a consummation", "That makes calamity of so long life;", "That patient merit of the unworthy takes,", "The fair Ophelia!--Nymph, in thy orisons", "The heartache, and the thousand natural shocks", "The insolence of office, and the spurns", "The oppressor's wrong, the proud man's contumely,", "The pangs of despis'd love, the law's delay,", "The slings and arrows of outrageous fortune", "The undiscover'd country, from whose bourn", "Thus conscience does make cowards of us all;", "To be, or not to be,--that is the question:--", "To grunt and sweat under a weary life,", "To sleep! perchance to dream:--ay, there's the rub;", "When he himself might his quietus make", "When we have shuffled off this mortal coil,", "Whether 'tis nobler in the mind to suffer", "With a bare bodkin? who would these fardels bear,", "With this regard, their currents turn awry,"] but: was ["And by opposing end them?--To die,--to sleep,--", "And enterprises of great pith and moment,", "And lose the name of action.--Soft you now!", "And makes us rather bear those ills we have", "And thus the native hue of resolution", "Be all my sins remember'd.", "But that the dread of something after death,--", "Devoutly to be wish'd. To die,--to sleep;--", "Devoutly to be wish'd. To die,--to sleep;--", "For in that sleep of death what dreams may come,", "For who would bear the whips and scorns of time,", "Is sicklied o'er with the pale cast of thought;", "Must give us pause: there's the respect", "No more; and by a sleep to say we end", "No more; and by a sleep to say we end", "No traveller returns,--puzzles the will,", "Or to take arms against a sea of troubles,", "Than fly to others that we know not of?", "That flesh is heir to,--'tis a consummation", "That flesh is heir to,--'tis a consummation", "That makes calamity of so long life;", "The fair Ophelia!--Nymph, in thy orisons", "The heartache, and the thousand natural shocks", "The heartache, and the thousand natural shocks", "The slings and arrows of outrageous fortune", "The undiscover'd country, from whose bourn", "Thus conscience does make cowards of us all;", "To be, or not to be,--that is the question:--", "To grunt and sweat under a weary life,", "To sleep! perchance to dream:--ay, there's the rub;", "To sleep! perchance to dream:--ay, there's the rub;", "When we have shuffled off this mortal coil,", "Whether 'tis nobler in the mind to suffer", "With a bare bodkin? who would these fardels bear,", "With this regard, their currents turn awry,"] at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.apache.flink.connector.file.src.FileSourceTextLinesITCase.verifyResult(FileSourceTextLinesITCase.java:198) at org.apache.flink.connector.file.src.FileSourceTextLinesITCase.testContinuousTextFileSource(FileSourceTextLinesITCase.java:151) at org.apache.flink.connector.file.src.FileSourceTextLinesITCase.testContinuousTextFileSourceWithTaskManagerFailover(FileSourceTextLinesITCase.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[jira] [Created] (FLINK-20261) Uncaught exception in ExecutorNotifier due to split assignment broken by failed task
Andrey Zagrebin created FLINK-20261: --- Summary: Uncaught exception in ExecutorNotifier due to split assignment broken by failed task Key: FLINK-20261 URL: https://issues.apache.org/jira/browse/FLINK-20261 Project: Flink Issue Type: Bug Components: Connectors / FileSystem Affects Versions: 1.12.0 Reporter: Andrey Zagrebin While trying to extend FileSourceTextLinesITCase::testContinuousTextFileSource with recovery test after TM failure (TestingMiniCluster::terminateTaskExecutor, [branch|https://github.com/azagrebin/flink/tree/FLINK-20118-it]), I encountered the following case: * SourceCoordinatorContext::assignSplits schedules async assignment (all reader tasks alive) * call TestingMiniCluster::terminateTaskExecutor while doing writeFile in a loop of testContinuousTextFileSource * causes graceful TaskExecutor::onStop shutdown * causes TM/RM disconnect and failing slot allocations in JM by RM * eventually causes SourceCoordinatorContext::unregisterSourceReader * actual assignment starts (SourceCoordinatorContext::assignSplits: callInCoordinatorThread) * registeredReaders.containsKey(subtaskId) check fails with IllegalArgumentException which is uncaught in single thread executor * forces ThreadPool to recreate the single thread * calls CoordinatorExecutorThreadFactory::newThread * fails expected condition of single thread creation with IllegalStateException which is uncaught * calls FatalExitExceptionHandler and exits JVM abruptly -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20171) Improve error message for Flink process memory configuration
Andrey Zagrebin created FLINK-20171: --- Summary: Improve error message for Flink process memory configuration Key: FLINK-20171 URL: https://issues.apache.org/jira/browse/FLINK-20171 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.11.0 Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.12.0, 1.11.0 Currently, all configuration failures will result in IllegalConfigurationException from JobManagerProcessUtils and TaskExecutorProcessUtils. The exception error messages do not refer to the process type (JM or TM), it can only become clear from the stack trace. We can wrap main configuration calls with extra try/catch (TaskExecutorProcessUtils::processSpecFromConfig and JobManagerProcessUtils::processSpecFromConfigWithNewOptionToInterpretLegacyHeap) where IllegalConfigurationException is wrapped into another one which states type of the process (JM or TM). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20078) Factor out an ExecutionGraph factory method for DefaultExecutionTopology
Andrey Zagrebin created FLINK-20078: --- Summary: Factor out an ExecutionGraph factory method for DefaultExecutionTopology Key: FLINK-20078 URL: https://issues.apache.org/jira/browse/FLINK-20078 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Andrey Zagrebin Based on [this PR discussion|https://github.com/apache/flink/pull/13958#discussion_r519676104]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19954) Move execution deployment tracking logic from legacy EG code to SchedulerNG
Andrey Zagrebin created FLINK-19954: --- Summary: Move execution deployment tracking logic from legacy EG code to SchedulerNG Key: FLINK-19954 URL: https://issues.apache.org/jira/browse/FLINK-19954 Project: Flink Issue Type: Task Components: Runtime / Coordination Affects Versions: 1.12.0 Reporter: Andrey Zagrebin FLINK-17075 introduced the execution state reconciliation between TM and JM. The reconciliation requires tracking of the execution deployment state. The tracking logic was added to the legacy code of EG state handling which is partially inactive as discussed in FLINK-19927. The recent state handling logic resides in the new SchedulerNG, currently DefaultScheduler. We could reconsider how the execution tracking for reconciliation is integrated with the scheduling. I think the tracking logic could be moved from Execution#deploy and EG#notifyExecutionChange to either SchedulerNG#updateTaskExecutionState or DefaultScheduler#deployTaskSafe. The latter looks to me currently more natural. ExecutionVertexOperations.deploy could return submission future for deployment completion in ExecutionDeploymentTracker and Execution#getTerminalFuture to stop the tracking. This would be also easier to unit test. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19923) Remove BulkSlotProvider, its implementation and tests
Andrey Zagrebin created FLINK-19923: --- Summary: Remove BulkSlotProvider, its implementation and tests Key: FLINK-19923 URL: https://issues.apache.org/jira/browse/FLINK-19923 Project: Flink Issue Type: Task Components: Runtime / Coordination Affects Versions: 1.12.0 Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.12.0 BulkSlotProvider is not used any more because it was introduced for the removed OneSlotAllocator replaced by SlotSharingExecutionSlotAllocator for the PipelinedRegionSchedulingStrategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19918) RocksIncrementalCheckpointRescalingTest.testScalingDown fails on Windows
Andrey Zagrebin created FLINK-19918: --- Summary: RocksIncrementalCheckpointRescalingTest.testScalingDown fails on Windows Key: FLINK-19918 URL: https://issues.apache.org/jira/browse/FLINK-19918 Project: Flink Issue Type: Bug Components: Runtime / State Backends Reporter: Andrey Zagrebin {code:java} java.lang.NullPointerExceptionjava.lang.NullPointerException at org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.close(AbstractStreamOperatorTestHarness.java:656) at org.apache.flink.contrib.streaming.state.RocksIncrementalCheckpointRescalingTest.closeHarness(RocksIncrementalCheckpointRescalingTest.java:357) at org.apache.flink.contrib.streaming.state.RocksIncrementalCheckpointRescalingTest.testScalingDown(RocksIncrementalCheckpointRescalingTest.java:276) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19917) RocksDBInitTest.testTempLibFolderDeletedOnFail fails on Windows
Andrey Zagrebin created FLINK-19917: --- Summary: RocksDBInitTest.testTempLibFolderDeletedOnFail fails on Windows Key: FLINK-19917 URL: https://issues.apache.org/jira/browse/FLINK-19917 Project: Flink Issue Type: Bug Components: Runtime / State Backends Reporter: Andrey Zagrebin {code:java} java.lang.AssertionError: Expected :0 Actual :2{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19860) Consider skipping restart and traverse regions which are already being restarted in RestartPipelinedRegionFailoverStrategy
Andrey Zagrebin created FLINK-19860: --- Summary: Consider skipping restart and traverse regions which are already being restarted in RestartPipelinedRegionFailoverStrategy Key: FLINK-19860 URL: https://issues.apache.org/jira/browse/FLINK-19860 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Andrey Zagrebin Original [discussion|https://github.com/apache/flink/pull/13749#pullrequestreview-516385846]. Follow-up for FLINK-19712. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19832) Improve handling of immediately failed physical slot in SlotSharingExecutionSlotAllocator
Andrey Zagrebin created FLINK-19832: --- Summary: Improve handling of immediately failed physical slot in SlotSharingExecutionSlotAllocator Key: FLINK-19832 URL: https://issues.apache.org/jira/browse/FLINK-19832 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.12.0 Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Improve handling of immediately failed physical slot in SlotSharingExecutionSlotAllocator If a physical slot future the immediately fails for a new SharedSlot in SlotSharingExecutionSlotAllocator#getOrAllocateSharedSlot but we continue to add logical slots to this SharedSlot, eventually, the logical slot also fails and gets removed from {{the SharedSlot}} which gets released (state RELEASED). The subsequent logical slot addings in the loop of {{allocateLogicalSlotsFromSharedSlots}} will fail the scheduling with the ALLOCATED state check because it will be RELEASED. The subsequent bulk timeout check will also not find the SharedSlot and fail with NPE. Hence, such SharedSlot with the immediately failed physical slot future should not be kept in the SlotSharingExecutionSlotAllocator and the logical slot requests depending on it can be immediately returned failed. The bulk timeout check does not need to be started because if some physical (and its logical) slot requests failed then the whole bulk will be canceled by scheduler. If the last assumption is not true for the future scheduling, this bulk failure might need additional explicit pending requests cancelation. We expect to refactor it for the declarative scheduling anyways. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19142) Investigate slot hijacking from preceding pipelined regions after failover
Andrey Zagrebin created FLINK-19142: --- Summary: Investigate slot hijacking from preceding pipelined regions after failover Key: FLINK-19142 URL: https://issues.apache.org/jira/browse/FLINK-19142 Project: Flink Issue Type: Improvement Reporter: Andrey Zagrebin The ticket originates from [this PR discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221]. The previous AllocationIDs are used by PreviousAllocationSlotSelectionStrategy to schedule subtasks into the slot where they were previously executed before a failover. If the previous slot (AllocationID) is not available, we do not want subtasks to take previous slots (AllocationIDs) of other subtasks. The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. The previous AllocationIDs of other bulks stay unknown. Therefore, the current bulk can potentially hijack the previous slots from the preceding bulks. On the other hand the previous AllocationIDs of other tasks should be taken if the other tasks are not going to run at the same time, e.g. not enough resources after failover or other bulks are done. One way to do it may be to give to MergingSharedSlotProfileRetriever all previous AllocationIDs of bulks which are going to run at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18957) Implement bulk fulfil-ability timeout tracking for shared slots
Andrey Zagrebin created FLINK-18957: --- Summary: Implement bulk fulfil-ability timeout tracking for shared slots Key: FLINK-18957 URL: https://issues.apache.org/jira/browse/FLINK-18957 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.12.0 Track fulfil-ability of required physical slots for all SharedSlot(s) (no matter whether they are created at this bulk or not) with timeout. This ensures we will not wait indefinitely if the required slots for this bulk cannot be fully fulfilled at the same time. # Create a LogicalSlotRequestBulk to track all physical requests and logical slot requests (logical slot requests only which belong to the bulk) # Mark physical slot request fulfilled in LogicalSlotRequestBulk, once its future is done # If any physical slot request fails then clear the LogicalSlotRequestBulk to stop the fulfil-ability check # Schedule a fulfil-ability check in LogicalSlotRequestBulkChecker for the LogicalSlotRequestBulk # In case of timeout: # cancel/fail the logical slot futures of the bulk in SharedSlot(s) # remove -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18751) Implement SlotSharingExecutionSlotAllocator
Andrey Zagrebin created FLINK-18751: --- Summary: Implement SlotSharingExecutionSlotAllocator Key: FLINK-18751 URL: https://issues.apache.org/jira/browse/FLINK-18751 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin SlotSharingExecutionSlotAllocator maintains a SharedSlot for each ExecutionSlotSharingGroup. SlotSharingExecutionSlotAllocator allocates physical slots for SharedSlot(s) and then allocates logical slots from it for scheduled tasks. In this way, the slot sharing hints can be respected in the ExecutionSlotAllocator. And we no longer need to rely on the SlotProvider to do the slot sharing matching. Co-location constraints will be respected since co-located subtasks will be in the same ExecutionSlotSharingGroup. The physical slot will be lazily allocated for a SharedSlot, upon any hosted subtask asking for the SharedSlot. Each subsequent sharing subtask allocates a logical slot from the SharedSlot. The SharedSlot/physical slot can be released only if all the requested logical slots are released or canceled. h4. Slot Allocation Process When SlotSharingExecutionSlotAllocator receives a set of tasks to allocate slots for, it should do the following: # Map the tasks to ExecutionSlotSharingGroup(s) # Check which ExecutionSlotSharingGroup(s) _already have_ SharedSlot(s) # For all involved ExecutionSlotSharingGroup(s) _which do not have a SharedSlot_ yet. # Create a SlotProfile future by MergingSharedSlotProfileRetriever and then # Allocate a physical slot from the PhysicalSlotProvider # Create SharedSlot based on the returned physical slot futures # If physical slot future fails, remove the SharedSlot # Allocate logical slot futures for the tasks from all corresponding SharedSlot(s). # Generates SlotExecutionVertexAssignment(s) based on the logical slot futures and returns the results. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18739) Implement MergingSharedSlotProfileRetriever
Andrey Zagrebin created FLINK-18739: --- Summary: Implement MergingSharedSlotProfileRetriever Key: FLINK-18739 URL: https://issues.apache.org/jira/browse/FLINK-18739 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Input location preferences will be considered for each SharedSlot when allocating a physical slot for it. Input location preferences of a SharedSlot will be the merge of input location preferences of all the tasks to run in this SharedSlot. Inter-ExecutionSlotSharingGroup input location preferences can be respected in this way for ExecutionSlotSharingGroups belonging to different bulks. If ExecutionSlotSharingGroups belong to the same bulk, the input location preferences are ignored because of possible cyclic dependencies. Later, we can optimise this case when the declarative resource management for reactive mode is ready. Intra-ExecutionSlotSharingGroup input location preferences will also be respected when creating ExecutionSlotSharingGroup(s) in DefaultSlotSharingStrategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18709) Implement PhysicalSlotProvider
Andrey Zagrebin created FLINK-18709: --- Summary: Implement PhysicalSlotProvider Key: FLINK-18709 URL: https://issues.apache.org/jira/browse/FLINK-18709 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin PhysicalSlotProviderImpl tries to allocate a physical slot from the available idle cached slots in SlotPool. If it is not possible, it requests a new slot from the SlotPool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18689) Deterministic Slot Sharing
Andrey Zagrebin created FLINK-18689: --- Summary: Deterministic Slot Sharing Key: FLINK-18689 URL: https://issues.apache.org/jira/browse/FLINK-18689 Project: Flink Issue Type: New Feature Components: Runtime / Coordination Reporter: Andrey Zagrebin [Design doc|https://docs.google.com/document/d/10CbCVDBJWafaFOovIXrR8nAr2BnZ_RAGFtUeTHiJplw] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18690) Implement LocalInputPreferredSlotSharingStrategy
Andrey Zagrebin created FLINK-18690: --- Summary: Implement LocalInputPreferredSlotSharingStrategy Key: FLINK-18690 URL: https://issues.apache.org/jira/browse/FLINK-18690 Project: Flink Issue Type: Sub-task Reporter: Andrey Zagrebin Assignee: Zhu Zhu Implement ExecutionSlotSharingGroup, SlotSharingStrategy and default LocalInputPreferredSlotSharingStrategy. The default strategy would be LocalInputPreferredSlotSharingStrategy. It will try to reduce remote data exchanges. Subtasks, which are connected and belong to the same SlotSharingGroup, tend to be put in the same ExecutionSlotSharingGroup. See [design doc|https://docs.google.com/document/d/10CbCVDBJWafaFOovIXrR8nAr2BnZ_RAGFtUeTHiJplw/edit#heading=h.t4vfmm4atqoy] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18646) Managed memory released check can block RPC thread
Andrey Zagrebin created FLINK-18646: --- Summary: Managed memory released check can block RPC thread Key: FLINK-18646 URL: https://issues.apache.org/jira/browse/FLINK-18646 Project: Flink Issue Type: Bug Components: Runtime / Task Affects Versions: 1.11.0 Reporter: Andrey Zagrebin UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on GC of all allocated/released managed memory. If there are a lot of segments to GC then it can take time to finish the check. If slot freeing happens in RPC thread, the GC waiting can block it and TM risks to miss its heartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18467) Document what can be reconfigured for state with TTL between job restarts
Andrey Zagrebin created FLINK-18467: --- Summary: Document what can be reconfigured for state with TTL between job restarts Key: FLINK-18467 URL: https://issues.apache.org/jira/browse/FLINK-18467 Project: Flink Issue Type: Task Components: Runtime / State Backends Affects Versions: 1.8.4 Reporter: Andrey Zagrebin changing whether the state has TTL or not is not easy as it requires migration but changing how to treat the expiration timestamp is possible, e.g. value of TTL or when to update/remove it -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18454) Add a code contribution section about how to look for what to contribute
Andrey Zagrebin created FLINK-18454: --- Summary: Add a code contribution section about how to look for what to contribute Key: FLINK-18454 URL: https://issues.apache.org/jira/browse/FLINK-18454 Project: Flink Issue Type: Task Components: Project Website Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin This section is to give general advices about browsing open Jira issues and starter tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18309) Recommend avoiding uppercase to emphasise statements in doc style
Andrey Zagrebin created FLINK-18309: --- Summary: Recommend avoiding uppercase to emphasise statements in doc style Key: FLINK-18309 URL: https://issues.apache.org/jira/browse/FLINK-18309 Project: Flink Issue Type: Improvement Components: Project Website Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Some contributions tend to use uppercase in user docs to highlight and/or emphasise statements. For example: "you MUST use the latest version". This style may appear somewhat aggressive to users. Therefore, I suggest to add a recommendation to not use uppercase in user docs. We could highlight this statements as note paragraphs or with less 'shooting' style, e.g. italics to draw user attention. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18308) KafkaProducerTestBase->Kafka011ProducerExactlyOnceITCase. testExactlyOnceCustomOperator hangs in Azure
Andrey Zagrebin created FLINK-18308: --- Summary: KafkaProducerTestBase->Kafka011ProducerExactlyOnceITCase. testExactlyOnceCustomOperator hangs in Azure Key: FLINK-18308 URL: https://issues.apache.org/jira/browse/FLINK-18308 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.11.0 Reporter: Andrey Zagrebin [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3267=logs=d44f43ce-542c-597d-bf94-b0718c71e5e8=03dca39c-73e8-5aaf-601d-328ae5c35f20] For last 3.5 hours, the test log ends with about 5000 entries like this: {code:java} 2020-06-11T11:04:54.3299945Z 11:04:54,328 [FailingIdentityMapper Status Printer] INFO org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper [] - > Failing mapper 0: count=690, totalCount=1000{code} The problem was observed not on master but in [this PR|https://github.com/apache/flink/pull/12596]. The PR is simple fatal error handling refactoring in TM. Therefore, the PR looks unrelated. Another run of this PR in My Azure CI [passes|https://dev.azure.com/azagrebin/azagrebin/_build/results?buildId=214=results]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18250) Enrich OOM error messages with more details in ClusterEntrypoint
Andrey Zagrebin created FLINK-18250: --- Summary: Enrich OOM error messages with more details in ClusterEntrypoint Key: FLINK-18250 URL: https://issues.apache.org/jira/browse/FLINK-18250 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.11.0 Similar to what we added for TM in [https://github.com/apache/flink/pull/11408], we should add more information and hints about out-of-memory failures to JM. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17811) Update docker hub Flink page
Andrey Zagrebin created FLINK-17811: --- Summary: Update docker hub Flink page Key: FLINK-17811 URL: https://issues.apache.org/jira/browse/FLINK-17811 Project: Flink Issue Type: Task Components: Deployment / Docker Reporter: Andrey Zagrebin In FLINK-17161, we refactored the Flink docker images docs. We should also update and possibly link the related Flink docs about docker integration in [docker hub Flink image description|https://hub.docker.com/_/flink?tab=description]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17740) Remove flink-container/kubernetes
Andrey Zagrebin created FLINK-17740: --- Summary: Remove flink-container/kubernetes Key: FLINK-17740 URL: https://issues.apache.org/jira/browse/FLINK-17740 Project: Flink Issue Type: Sub-task Components: Deployment / Docker, Deployment / Kubernetes Reporter: Andrey Zagrebin Assignee: Chesnay Schepler Fix For: 1.11.0 FLINK-17161 added Kubernetes integration examples for Job Cluster. FLINK-17656 copies job service yaml from flink-container/kubernetes to e2e Kubernetes Job Cluster test making them independent. Therefore, we do not need flink-container/kubernetes and it can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17652) Legacy JM heap options should fallback to new JVM_HEAP_MEMORY in standalone
Andrey Zagrebin created FLINK-17652: --- Summary: Legacy JM heap options should fallback to new JVM_HEAP_MEMORY in standalone Key: FLINK-17652 URL: https://issues.apache.org/jira/browse/FLINK-17652 Project: Flink Issue Type: Bug Components: Runtime / Configuration, Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.11.0 FLINK-16742 states that the legacy JM heap options should fallback to JobManagerOptions.JVM_HEAP_MEMORY in standalone scripts. BashJavaUtils#getJmResourceParams has been implemented to fallback to JobManagerOptions.TOTAL_FLINK_MEMORY. This should be fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17546) Consider setting the number of TM CPU cores to the actual number of cores
Andrey Zagrebin created FLINK-17546: --- Summary: Consider setting the number of TM CPU cores to the actual number of cores Key: FLINK-17546 URL: https://issues.apache.org/jira/browse/FLINK-17546 Project: Flink Issue Type: Improvement Components: Runtime / Configuration Reporter: Andrey Zagrebin So far we do not use CPU cores resource in TaskExecutorResourceSpec. It was a preparation for dynamic slot/resource allocation (FLINK-14187). It is not fully clear how Flink or users would define the number of cores. We could consider setting the number of TM CPU cores to the actual number of cores by default, e.g. got somehow from OS in standalone or container configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17465) Update Chinese user documentation for job manager memory model
Andrey Zagrebin created FLINK-17465: --- Summary: Update Chinese user documentation for job manager memory model Key: FLINK-17465 URL: https://issues.apache.org/jira/browse/FLINK-17465 Project: Flink Issue Type: Sub-task Reporter: Andrey Zagrebin Fix For: 1.11.0 This is a follow-up for FLINK-16946. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17344) RecordWriterTest.testIdleTime possibly deadlocks on Travis
Andrey Zagrebin created FLINK-17344: --- Summary: RecordWriterTest.testIdleTime possibly deadlocks on Travis Key: FLINK-17344 URL: https://issues.apache.org/jira/browse/FLINK-17344 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.11.0 Reporter: Andrey Zagrebin Fix For: 1.11.0 https://travis-ci.org/github/apache/flink/jobs/678193214 The test was introduced in [FLINK-16864|https://jira.apache.org/jira/browse/FLINK-16864]. It may be an instability as it passed 2 times (core and core-scala) and failed in core-hadoop: https://travis-ci.org/github/apache/flink/builds/678193199 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17167) Extend entry point script and docs with history server mode
Andrey Zagrebin created FLINK-17167: --- Summary: Extend entry point script and docs with history server mode Key: FLINK-17167 URL: https://issues.apache.org/jira/browse/FLINK-17167 Project: Flink Issue Type: Sub-task Components: Deployment / Docker Reporter: Andrey Zagrebin Assignee: Sebastian J. Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17166) Modify the log4j-console.properties to also output logs into the files for WebUI
Andrey Zagrebin created FLINK-17166: --- Summary: Modify the log4j-console.properties to also output logs into the files for WebUI Key: FLINK-17166 URL: https://issues.apache.org/jira/browse/FLINK-17166 Project: Flink Issue Type: Sub-task Components: Runtime / Configuration Reporter: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17165) Remove flink-container/docker
Andrey Zagrebin created FLINK-17165: --- Summary: Remove flink-container/docker Key: FLINK-17165 URL: https://issues.apache.org/jira/browse/FLINK-17165 Project: Flink Issue Type: Sub-task Components: Deployment / Docker Reporter: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17164) Extend entry point script and docs with job cluster mode and user job artefacts
Andrey Zagrebin created FLINK-17164: --- Summary: Extend entry point script and docs with job cluster mode and user job artefacts Key: FLINK-17164 URL: https://issues.apache.org/jira/browse/FLINK-17164 Project: Flink Issue Type: Sub-task Components: Deployment / Docker Reporter: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17163) Remove flink-contrib/docker-flink
Andrey Zagrebin created FLINK-17163: --- Summary: Remove flink-contrib/docker-flink Key: FLINK-17163 URL: https://issues.apache.org/jira/browse/FLINK-17163 Project: Flink Issue Type: Sub-task Components: Deployment / Docker Reporter: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17162) Document examples of how to extend the official docker hub image
Andrey Zagrebin created FLINK-17162: --- Summary: Document examples of how to extend the official docker hub image Key: FLINK-17162 URL: https://issues.apache.org/jira/browse/FLINK-17162 Project: Flink Issue Type: Sub-task Components: Deployment / Docker, Documentation Reporter: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17161) Document the official docker hub image and examples of how to run
Andrey Zagrebin created FLINK-17161: --- Summary: Document the official docker hub image and examples of how to run Key: FLINK-17161 URL: https://issues.apache.org/jira/browse/FLINK-17161 Project: Flink Issue Type: Sub-task Components: Deployment / Docker, Documentation Reporter: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17160) FLIP-111: Docker image unification
Andrey Zagrebin created FLINK-17160: --- Summary: FLIP-111: Docker image unification Key: FLINK-17160 URL: https://issues.apache.org/jira/browse/FLINK-17160 Project: Flink Issue Type: Improvement Components: Deployment / Docker, Dockerfiles Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.11.0 This an umbrella issue for [FLIP-111.|https://cwiki.apache.org/confluence/display/FLINK/FLIP-111%3A+Docker+image+unification] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17048) Add memory related JVM args to Mesos JM startup scripts
Andrey Zagrebin created FLINK-17048: --- Summary: Add memory related JVM args to Mesos JM startup scripts Key: FLINK-17048 URL: https://issues.apache.org/jira/browse/FLINK-17048 Project: Flink Issue Type: Sub-task Components: Runtime / Configuration, Runtime / Coordination Reporter: Andrey Zagrebin It looks like we never respected memory configuration in Mesos JM startup scripts: mesos-appmaster.sh mesos-appmaster-job.sh Now we have a chance to adopt FLIP-116 here as well, similar to what we are doing with standalone scripts in FLINK-16742 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16946) Update user documentation for job manager memory model
Andrey Zagrebin created FLINK-16946: --- Summary: Update user documentation for job manager memory model Key: FLINK-16946 URL: https://issues.apache.org/jira/browse/FLINK-16946 Project: Flink Issue Type: Sub-task Components: Documentation, Runtime / Configuration, Runtime / Coordination Reporter: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16754) Consider refactoring of ProcessMemoryUtilsTestBase to avoid inheritance
Andrey Zagrebin created FLINK-16754: --- Summary: Consider refactoring of ProcessMemoryUtilsTestBase to avoid inheritance Key: FLINK-16754 URL: https://issues.apache.org/jira/browse/FLINK-16754 Project: Flink Issue Type: Sub-task Components: Runtime / Configuration, Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.11.0 After FLINK-16615 we have class structure of memory utils with isolation of responsibilities, mostly through composition. We should consider to refactor the tests as well to get more abstraction targeted tests with better isolation and w/o implicit test inheritance contracts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16742) Extend and use BashJavaUtils to start JM JVM process and pass JVM memory args
Andrey Zagrebin created FLINK-16742: --- Summary: Extend and use BashJavaUtils to start JM JVM process and pass JVM memory args Key: FLINK-16742 URL: https://issues.apache.org/jira/browse/FLINK-16742 Project: Flink Issue Type: Sub-task Components: Deployment / Scripts, Runtime / Configuration, Runtime / Coordination Reporter: Andrey Zagrebin Currently, the legacy options `_jobmanager.heap.size_` (or `_jobmanager.heap.mb_`) is used in JM standalone bash scripts to pass JVM heap size arg and start JM JVM process. BashJavaUtils should be extended to get JVM memory arg list string from Flink configuration. BashJavaUtils can use JobManagerProcessUtils#processSpecFromConfig to obtain JobManagerProcessSpec. JobManagerProcessSpec can be passed to ProcessMemoryUtils#generateJvmParametersStr to get JVM memory arg list string. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16746) Deprecate/remove legacy memory options for JM and expose the new ones
Andrey Zagrebin created FLINK-16746: --- Summary: Deprecate/remove legacy memory options for JM and expose the new ones Key: FLINK-16746 URL: https://issues.apache.org/jira/browse/FLINK-16746 Project: Flink Issue Type: Sub-task Components: Runtime / Configuration, Runtime / Coordination Reporter: Andrey Zagrebin Fix For: 1.11.0 Deprecate legacy heap options: `_jobmanager.heap.size_` (and update `_jobmanager.heap.mb_`) Remove container cut-off options: `_containerized.heap-cutoff-ratio_` and `_containerized.heap-cutoff-min_` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16745) Use JobManagerProcessUtils to start JM container and pass JVM memory args
Andrey Zagrebin created FLINK-16745: --- Summary: Use JobManagerProcessUtils to start JM container and pass JVM memory args Key: FLINK-16745 URL: https://issues.apache.org/jira/browse/FLINK-16745 Project: Flink Issue Type: Sub-task Components: Deployment / Kubernetes, Deployment / Mesos, Deployment / YARN, Runtime / Configuration, Runtime / Coordination Reporter: Andrey Zagrebin Fix For: 1.11.0 JobManagerProcessUtils#processSpecFromConfig should be used to get JobManagerProcessSpec. Then JobManagerProcessSpec can be passed to ProcessMemoryUtils#generateJvmParametersStr to get JVM memory arg list string. The configuration should be fixed to fallback to JobManagerOptions.TOTAL_PROCESS_MEMORY if a legacy option is set (JobManagerProcessUtils#getConfigurationWithLegacyHeapSizeMappedToNewConfigOption) before passing it to JobManagerProcessUtils#processSpecFromConfig. Then the JVM memory arg list can be used to start the JM container in Yarn/Mesos/Kubernetes active RMs instead of using the existing legacy heap options: `_jobmanager.heap.size_` (or `_jobmanager.heap.mb_`). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16686) [State TTL] Make user class loader available in native RocksDB compaction thread
Andrey Zagrebin created FLINK-16686: --- Summary: [State TTL] Make user class loader available in native RocksDB compaction thread Key: FLINK-16686 URL: https://issues.apache.org/jira/browse/FLINK-16686 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.8.0 Reporter: Andrey Zagrebin The issue is initially reported [here|https://stackoverflow.com/questions/60745711/flink-kryo-serializer-because-chill-serializer-couldnt-be-found]. The problem is that the java code of Flink compaction filter is called from RocksDB native C++ code. It is called in the context of the native compaction thread. RocksDB has utilities to create java Thread context for the Flink java callback. Presumably, the Java thread context class loader is not set at all and if it is queried then it produces NullPointerException. The provided report enabled a list state with TTL. The compaction filter has to deserialise elements to check expiration. The deserialiser relies on Kryo which queries the thread context class loader which is expected to be the user class loader of the task but turns out to be null. We should investigate how to pass the user class loader to the compaction thread of the list state with TTL. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16615) Introduce data structures and utilities to calculate Job Manager memory components
Andrey Zagrebin created FLINK-16615: --- Summary: Introduce data structures and utilities to calculate Job Manager memory components Key: FLINK-16615 URL: https://issues.apache.org/jira/browse/FLINK-16615 Project: Flink Issue Type: Sub-task Components: Runtime / Configuration, Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16614) FLIP-116 Unified Memory Configuration for Job Manager
Andrey Zagrebin created FLINK-16614: --- Summary: FLIP-116 Unified Memory Configuration for Job Manager Key: FLINK-16614 URL: https://issues.apache.org/jira/browse/FLINK-16614 Project: Flink Issue Type: Improvement Components: Runtime / Configuration, Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.11.0 This is the umbrella issue of [FLIP-116: Unified Memory Configuration for Job Managers|https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16406) Increase default value for JVM Metaspace to minimise its OutOfMemoryError
Andrey Zagrebin created FLINK-16406: --- Summary: Increase default value for JVM Metaspace to minimise its OutOfMemoryError Key: FLINK-16406 URL: https://issues.apache.org/jira/browse/FLINK-16406 Project: Flink Issue Type: Improvement Components: Runtime / Configuration, Runtime / Task Affects Versions: 1.10.0 Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.1, 1.11.0 With FLIP-49 ([FLINK-13980|https://issues.apache.org/jira/browse/FLINK-13980]), we introduced a limit for JVM Metaspace ('taskmanager.memory.jvm-metaspace.size') when TM JVM process is started. It caused '_OutOfMemoryError: Metaspace_' for some use cases after upgrading to the latest 1.10 version. In some cases, a real class loading leak has been discovered, like in [FLINK-16142|https://issues.apache.org/jira/browse/FLINK-16142]. Some users had to increase the default value to accommodate for their use cases (mostly from 96Mb to 256Mb). While this limit was introduced to properly plan Flink resources, especially for container environment, and to detect class loading leaks, the user experience should be as smooth as possible. One way is provide good documentation for this change ([FLINK-16278|https://issues.apache.org/jira/browse/FLINK-16278]). Another question is the sanity of the default value. It is still arguable what the default value should be (currently 96Mb). In general, the size depends on the use case (job user code, how many jobs are deployed in the cluster etc). This issue tries to tackle this problem by firstly increasing it to 256Mb. We also want to poll which Metaspace setting resolved the _OutOfMemoryError_. Please, if you encountered this problem, report here any relevant specifics of your job and your Metaspace size if there was no class loading leak. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16198) FileUtilsTest fails on Mac OS
Andrey Zagrebin created FLINK-16198: --- Summary: FileUtilsTest fails on Mac OS Key: FLINK-16198 URL: https://issues.apache.org/jira/browse/FLINK-16198 Project: Flink Issue Type: Bug Components: FileSystems, Tests Reporter: Andrey Zagrebin The following tests fail if run on Mac OS (IDE/maven). FileUtilsTest.testCompressionOnRelativePath: {code:java} java.nio.file.NoSuchFileException: ../../../../../var/folders/67/v4yp_42d21j6_n8k1h556h0cgn/T/junit6496651678375117676/compressDir/rootDirjava.nio.file.NoSuchFileException: ../../../../../var/folders/67/v4yp_42d21j6_n8k1h556h0cgn/T/junit6496651678375117676/compressDir/rootDir at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) at java.nio.file.Files.createDirectory(Files.java:674) at org.apache.flink.util.FileUtilsTest.verifyDirectoryCompression(FileUtilsTest.java:440) at org.apache.flink.util.FileUtilsTest.testCompressionOnRelativePath(FileUtilsTest.java:261) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) {code} FileUtilsTest.testDeleteDirectoryConcurrently {code:java} java.nio.file.FileSystemException: /var/folders/67/v4yp_42d21j6_n8k1h556h0cgn/T/junit7558825557740784886/junit3566161583262218465/ab1fa0bde8b22cad58b717508c7a7300/121fdf5f7b057183843ed2e1298f9b66/6598025f390d3084d69c98b36e542fe2/8db7cd9c063396a19a86f5b63ce53f66: Invalid argument at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108) at java.nio.file.Files.deleteIfExists(Files.java:1165) at org.apache.flink.util.FileUtils.deleteFileOrDirectoryInternal(FileUtils.java:324) at org.apache.flink.util.FileUtils.guardIfWindows(FileUtils.java:391) at org.apache.flink.util.FileUtils.deleteFileOrDirectory(FileUtils.java:258) at org.apache.flink.util.FileUtils.cleanDirectoryInternal(FileUtils.java:376) at org.apache.flink.util.FileUtils.deleteDirectoryInternal(FileUtils.java:335) at org.apache.flink.util.FileUtils.deleteFileOrDirectoryInternal(FileUtils.java:320) at org.apache.flink.util.FileUtils.guardIfWindows(FileUtils.java:391) at org.apache.flink.util.FileUtils.deleteFileOrDirectory(FileUtils.java:258) at org.apache.flink.util.FileUtils.cleanDirectoryInternal(FileUtils.java:376) at org.apache.flink.util.FileUtils.deleteDirectoryInternal(FileUtils.java:335) at
[jira] [Created] (FLINK-15991) Create Chinese documentation for FLIP-49 TM memory model
Andrey Zagrebin created FLINK-15991: --- Summary: Create Chinese documentation for FLIP-49 TM memory model Key: FLINK-15991 URL: https://issues.apache.org/jira/browse/FLINK-15991 Project: Flink Issue Type: Task Components: Documentation Reporter: Andrey Zagrebin Assignee: Xintong Song Fix For: 1.10.0, 1.11.0 Chinese translation of FLINK-15143 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15989) Rewrap OutOfMemoryError in allocateUnpooledOffHeap with better message
Andrey Zagrebin created FLINK-15989: --- Summary: Rewrap OutOfMemoryError in allocateUnpooledOffHeap with better message Key: FLINK-15989 URL: https://issues.apache.org/jira/browse/FLINK-15989 Project: Flink Issue Type: Improvement Reporter: Andrey Zagrebin Fix For: 1.10.1, 1.11.0 Now if Flink allocates direct memory in MemorySegmentFactory#allocateUnpooledOffHeapMemory and its limit is exceeded for any reason, e.g. user code over-allocated direct memory, ByteBuffer#allocateDirect will throw a generic "OutOfMemoryError: Direct buffer memory". We can catch it and add a message which provides more explanation and points to an option taskmanager.memory.task.off-heap.size to increase as a possible solution. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15946) Task manager Kubernetes pods take long time to terminate
Andrey Zagrebin created FLINK-15946: --- Summary: Task manager Kubernetes pods take long time to terminate Key: FLINK-15946 URL: https://issues.apache.org/jira/browse/FLINK-15946 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes, Runtime / Coordination Affects Versions: 1.10.0 Reporter: Andrey Zagrebin The problem is initially described in this [ML thread|https://mail-archives.apache.org/mod_mbox/flink-user/202002.mbox/browser]. We should investigate whether and if yes, why the TM pod killing/shutdown is delayed by reconnecting to the terminated JM. cc [~fly_in_gis] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15942) Improve logging of infinite resource profile
Andrey Zagrebin created FLINK-15942: --- Summary: Improve logging of infinite resource profile Key: FLINK-15942 URL: https://issues.apache.org/jira/browse/FLINK-15942 Project: Flink Issue Type: Improvement Components: Runtime / Configuration, Runtime / Task Affects Versions: 1.10.0 Reporter: Andrey Zagrebin After we set task memory and CPU to infinity in FLINK-15763, it spoiled the logs: {code:java} 00:23:49,442 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile{cpuCores=44942328371557892500., taskHeapMemory=2097152.000tb (2305843009213693951 bytes), taskOffHeapMemory=2097152.000tb (2305843009213693951 bytes), managedMemory=20.000mb (20971520 bytes), networkMemory=16.000mb (16777216 bytes)}, allocationId: 349dacfbf1ac4d0b44a2d11e1976d264, jobId: 689a0cf24b40f16b6f45157f78754c46). {code} We should treat the infinity as a special case and print it accordingly -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15774) Consider adding an explicit network memory size config option
Andrey Zagrebin created FLINK-15774: --- Summary: Consider adding an explicit network memory size config option Key: FLINK-15774 URL: https://issues.apache.org/jira/browse/FLINK-15774 Project: Flink Issue Type: Improvement Components: Runtime / Configuration, Runtime / Task Reporter: Andrey Zagrebin See [PR discussion|[https://github.com/apache/flink/pull/10946#discussion_r371169251]] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15763) Set necessary resource configuration options to defaults for local execution ignoring FLIP-49
Andrey Zagrebin created FLINK-15763: --- Summary: Set necessary resource configuration options to defaults for local execution ignoring FLIP-49 Key: FLINK-15763 URL: https://issues.apache.org/jira/browse/FLINK-15763 Project: Flink Issue Type: Improvement Components: Runtime / Task Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15758) Investigate potential out-of-memory problems due to managed unsafe memory allocation
Andrey Zagrebin created FLINK-15758: --- Summary: Investigate potential out-of-memory problems due to managed unsafe memory allocation Key: FLINK-15758 URL: https://issues.apache.org/jira/browse/FLINK-15758 Project: Flink Issue Type: Task Components: Runtime / Task Reporter: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15741) Fix TTL docs after enabling RocksDB compaction filter by default
Andrey Zagrebin created FLINK-15741: --- Summary: Fix TTL docs after enabling RocksDB compaction filter by default Key: FLINK-15741 URL: https://issues.apache.org/jira/browse/FLINK-15741 Project: Flink Issue Type: Task Components: Runtime / State Backends Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0 RocksDB compaction filter is always enabled by default after [FLINK-15506 |https://issues.apache.org/jira/browse/FLINK-15506]and we deprecated its disabling. The docs should not refer to its enabling/disabling. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15621) State TTL: Remove deprecated option and method to disable TTL compaction filter
Andrey Zagrebin created FLINK-15621: --- Summary: State TTL: Remove deprecated option and method to disable TTL compaction filter Key: FLINK-15621 URL: https://issues.apache.org/jira/browse/FLINK-15621 Project: Flink Issue Type: Task Reporter: Andrey Zagrebin Fix For: 1.11.0 Follow-up for FLINK-15506. * Remove RocksDBOptions#TTL_COMPACT_FILTER_ENABLED * Remove RocksDBStateBackend#enableTtlCompactionFilter/isTtlCompactionFilterEnabled/disableTtlCompactionFilter, also in python API * Cleanup code from this flag and tests, also in python API * Check any related code in [frocksdb|[https://github.com/dataArtisans/frocksdb]] if any -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15620) State TTL: Remove deprecated enable default background cleanup
Andrey Zagrebin created FLINK-15620: --- Summary: State TTL: Remove deprecated enable default background cleanup Key: FLINK-15620 URL: https://issues.apache.org/jira/browse/FLINK-15620 Project: Flink Issue Type: Task Components: Runtime / State Backends Reporter: Andrey Zagrebin Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15606) Deprecate enable default background cleanup of state with TTL
Andrey Zagrebin created FLINK-15606: --- Summary: Deprecate enable default background cleanup of state with TTL Key: FLINK-15606 URL: https://issues.apache.org/jira/browse/FLINK-15606 Project: Flink Issue Type: Improvement Components: Runtime / State Backends Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0 Follow-up for FLINK-14898. Enabling TTL without any background cleanup does not make too much sense. So we can keep it always enabled, just cleanup settings can be tweaked for particular backends. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15605) Remove deprecated in 1.9 StateTtlConfig.TimeCharacteristic
Andrey Zagrebin created FLINK-15605: --- Summary: Remove deprecated in 1.9 StateTtlConfig.TimeCharacteristic Key: FLINK-15605 URL: https://issues.apache.org/jira/browse/FLINK-15605 Project: Flink Issue Type: Improvement Components: Runtime / State Backends Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15597) Relax sanity check of JVM memory overhead to be within its min/max
Andrey Zagrebin created FLINK-15597: --- Summary: Relax sanity check of JVM memory overhead to be within its min/max Key: FLINK-15597 URL: https://issues.apache.org/jira/browse/FLINK-15597 Project: Flink Issue Type: Improvement Components: Runtime / Configuration, Runtime / Task Reporter: Andrey Zagrebin Assignee: Xintong Song Fix For: 1.10.0 When the explicitly configured process and Flink memory sizes are verified with the JVM meta space and overhead, JVM overhead does not have to be the exact fraction. It can be just within its min/max range, similar to how it is now for network/shuffle memory check after FLINK-15300. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15519) Preserve logs from BashJavaUtils and make them part of TM logs
Andrey Zagrebin created FLINK-15519: --- Summary: Preserve logs from BashJavaUtils and make them part of TM logs Key: FLINK-15519 URL: https://issues.apache.org/jira/browse/FLINK-15519 Project: Flink Issue Type: Improvement Components: Runtime / Configuration Affects Versions: 1.10.0 Reporter: Andrey Zagrebin Fix For: 1.10.0 In FLINK-13983 we introduced BashJavaUtils utility to call in taskmanager.sh before starting TM and calculate memory configuration for the JVM process of TM. Ideally, it would be nice to preserve BashJavaUtils logs and make them part of the TM logs. Currently, logging for BashJavaUtils is configured from the class path and can differ from TM logging. Moreover TM logging can rewrite BashJavaUtils even if we align their loggings (e.g. log4j.appender.file.append=false in default log4j.properties for Flink). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15517) Use back 'network' in 'shuffle' memory config option names
Andrey Zagrebin created FLINK-15517: --- Summary: Use back 'network' in 'shuffle' memory config option names Key: FLINK-15517 URL: https://issues.apache.org/jira/browse/FLINK-15517 Project: Flink Issue Type: Improvement Components: Runtime / Configuration, Runtime / Task Affects Versions: 1.10.0 Reporter: Andrey Zagrebin Fix For: 1.10.0 The issue is based on the discussion outcome on [Dev ML|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Some-feedback-after-trying-out-the-new-FLIP-49-memory-configurations-td36129.html]. As there is no strong consensus about using new 'shuffle' naming for memory config options. We decided to fallback to the the existing naming and use 'network' instead of 'shuffle'. We still keeping the new naming convention with prefix 'taskmanager.memory.*'. Basically, we just need to rename 'taskmanager.memory.shuffle.*' to 'taskmanager.memory.network.*' as 'shuffle' naming has never been released. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15300) Shuffle memory fraction sanity check does not account for its min/max limit
Andrey Zagrebin created FLINK-15300: --- Summary: Shuffle memory fraction sanity check does not account for its min/max limit Key: FLINK-15300 URL: https://issues.apache.org/jira/browse/FLINK-15300 Project: Flink Issue Type: Bug Components: Runtime / Configuration Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0 If we have a configuration which results in setting shuffle memory size to its min or max, not fraction during TM startup then starting TM parses generated dynamic properties and while doing the sanity check (TaskExecutorResourceUtils#sanityCheckShuffleMemory) it fails because it checks the exact fraction for min/max value. Example, start TM with the following Flink config: {code:java} taskmanager.memory.total-flink.size: 350m taskmanager.memory.framework.heap.size: 16m taskmanager.memory.shuffle.fraction: 0.1{code} It will result in the following extra program args: {code:java} taskmanager.memory.shuffle.max: 67108864b taskmanager.memory.framework.off-heap.size: 134217728b taskmanager.memory.managed.size: 146800642b taskmanager.cpu.cores: 1.0 taskmanager.memory.task.heap.size: 2097150b taskmanager.memory.task.off-heap.size: 0b taskmanager.memory.shuffle.min: 67108864b{code} where the derived fraction was less than shuffle memory min size (64mb), so it was set to the min value: 64mb. While TM starts, TaskExecutorResourceUtils#sanityCheckShuffleMemory trows the following exception: {code:java} org.apache.flink.configuration.IllegalConfigurationException: Derived Shuffle Memory size(64 Mb (67108864 bytes)) does not match configured Shuffle Memory fraction (0.1000149011612).org.apache.flink.configuration.IllegalConfigurationException: Derived Shuffle Memory size(64 Mb (67108864 bytes)) does not match configured Shuffle Memory fraction (0.1000149011612). at org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.sanityCheckShuffleMemory(TaskExecutorResourceUtils.java:552) at org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.deriveResourceSpecWithExplicitTaskAndManagedMemory(TaskExecutorResourceUtils.java:183) at org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:135) {code} This can be fixed by checking whether the fraction to assert is within the min/max range. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15198) Remove deprecated mesos.resourcemanager.tasks.mem in 1.11
Andrey Zagrebin created FLINK-15198: --- Summary: Remove deprecated mesos.resourcemanager.tasks.mem in 1.11 Key: FLINK-15198 URL: https://issues.apache.org/jira/browse/FLINK-15198 Project: Flink Issue Type: Task Components: Deployment / Mesos Affects Versions: 1.11.0 Reporter: Andrey Zagrebin Fix For: 1.11.0 In FLINK-15082, we deprecated 'mesos.resourcemanager.tasks.mem' in favour of the new unified option 'taskmanager.memory.total-process.size' from FLIP-49. We should remove it now in 1.11. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15094) Warning about using private constructor of java.nio.DirectByteBuffer in Java 11
Andrey Zagrebin created FLINK-15094: --- Summary: Warning about using private constructor of java.nio.DirectByteBuffer in Java 11 Key: FLINK-15094 URL: https://issues.apache.org/jira/browse/FLINK-15094 Project: Flink Issue Type: Improvement Components: Runtime / Task Reporter: Andrey Zagrebin The unsafe off-heap in [FLINK-13985|https://jira.apache.org/jira/browse/FLINK-13985] was implemented by hacking into a private constructor of java.nio.DirectByteBuffer. This causes undesirable warnings in Java 11: {code:java} WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.flink.core.memory.MemoryUtils (file:/C:/Development/repos/flink/flink-core/target/classes/) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.flink.core.memory.MemoryUtils WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14901) Throw Error in MemoryUtils if there is problem with using system classes over reflection
Andrey Zagrebin created FLINK-14901: --- Summary: Throw Error in MemoryUtils if there is problem with using system classes over reflection Key: FLINK-14901 URL: https://issues.apache.org/jira/browse/FLINK-14901 Project: Flink Issue Type: Improvement Components: Runtime / Task Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0 If something goes wrong using system classes over reflection n MemoryUtils or JavaGcCleanerWrapper, Flink cannot really recover from it. We should throw an Error in this case which should stay unhandled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14898) Enable background cleanup of state with TTL by default
Andrey Zagrebin created FLINK-14898: --- Summary: Enable background cleanup of state with TTL by default Key: FLINK-14898 URL: https://issues.apache.org/jira/browse/FLINK-14898 Project: Flink Issue Type: Improvement Components: Runtime / State Backends Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0 So far, we were conservative about enabling background cleanup strategies for state with TTL. In general, if state is configured to have TTL, most users would expect the background cleanup to kick in. As there were no reported issues so far since the release of backend specific cleanups and that should not affect any state without TTL, this issue suggests to enable default background cleanup for backends. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14637) Introduce framework off heap memory config
Andrey Zagrebin created FLINK-14637: --- Summary: Introduce framework off heap memory config Key: FLINK-14637 URL: https://issues.apache.org/jira/browse/FLINK-14637 Project: Flink Issue Type: Sub-task Components: Runtime / Task Reporter: Andrey Zagrebin Fix For: 1.10.0 At the moment after [-FLINK-13982-|https://issues.apache.org/jira/browse/FLINK-13982], when we do not account for adhoc direct memory allocations for Flink framework (except network buffers) or done by some used libraries used in Flink. In general, we expect this allocations to stay under a certain reasonably low limit but we have to have some margin for them so that JVM direct memory limit is not exactly equal to network buffers and does not fail. We can address it by introducing framework off heap memory config option. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14633) Account for netty direct allocations in direct memory limit (Queryable state)
Andrey Zagrebin created FLINK-14633: --- Summary: Account for netty direct allocations in direct memory limit (Queryable state) Key: FLINK-14633 URL: https://issues.apache.org/jira/browse/FLINK-14633 Project: Flink Issue Type: Sub-task Components: Runtime / Queryable State, Runtime / Task Reporter: Andrey Zagrebin Fix For: 1.10.0 At the moment after [-FLINK-13982-|https://issues.apache.org/jira/browse/FLINK-13982], when we calculate JVM direct memory limit, we do not account for direct allocations from netty arenas in org.apache.flink.queryablestate.network.NettyBufferPool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14631) Account for netty direct allocations in direct memory limit
Andrey Zagrebin created FLINK-14631: --- Summary: Account for netty direct allocations in direct memory limit Key: FLINK-14631 URL: https://issues.apache.org/jira/browse/FLINK-14631 Project: Flink Issue Type: Sub-task Components: Runtime / Task Reporter: Andrey Zagrebin Fix For: 1.10.0 At the moment after FLINK-13982, when we calculate JVM direct memory limit, we only account for memory segment network buffers but not for direct allocations from netty arenas in org.apache.flink.runtime.io.network.netty.NettyBufferPool. We should include netty arenas into shuffle memory calculations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14522) Adjust GC Cleaner for unsafe memory and Java 11
Andrey Zagrebin created FLINK-14522: --- Summary: Adjust GC Cleaner for unsafe memory and Java 11 Key: FLINK-14522 URL: https://issues.apache.org/jira/browse/FLINK-14522 Project: Flink Issue Type: Sub-task Components: Runtime / Task Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0 sun.misc.Cleaner is not available in Java 11. It was moved to jdk.internal.ref.Cleaner of java.base module. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14400) Shrink the scope of MemoryManager from TaskExecutor to slot
Andrey Zagrebin created FLINK-14400: --- Summary: Shrink the scope of MemoryManager from TaskExecutor to slot Key: FLINK-14400 URL: https://issues.apache.org/jira/browse/FLINK-14400 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination, Runtime / Task Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin MemoryManager currently manages the memory bookkeeping for all slots/tasks inside one TaskExecutor. For better abstraction and isolation of slots, we can shrink its scope and make it per slot. The memory limits are fixed now per slot at the moment of slot creation. All operators, sharing the slot, will get their fixed fractional limits. In future, we might make it possible for operators to over-allocate beyond their fraction limit if there is some available free memory in the slot but it should be possible to reclaim the over-allocated memory at any time if other operator decides to claim its fair share within its limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14399) Add memory chunk reservation API to MemoryManager
Andrey Zagrebin created FLINK-14399: --- Summary: Add memory chunk reservation API to MemoryManager Key: FLINK-14399 URL: https://issues.apache.org/jira/browse/FLINK-14399 Project: Flink Issue Type: Sub-task Components: Runtime / Task Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin MemoryManager allocates paged segments from the provided memory pools of different types (on-/off-heap). Additionally, it can manage reservation and release of arbitrarily sized chunks of memory from the same memory pools respecting their overall limit. The way, how the memory is allocated, used and freed, is then up to the memory user. MemoryManager is just a book-keeping and limit checking component. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-13963) Consolidate Hadoop file systems usage and Hadoop integration docs
Andrey Zagrebin created FLINK-13963: --- Summary: Consolidate Hadoop file systems usage and Hadoop integration docs Key: FLINK-13963 URL: https://issues.apache.org/jira/browse/FLINK-13963 Project: Flink Issue Type: Improvement Components: Connectors / FileSystem, Connectors / Hadoop Compatibility, Documentation, FileSystems Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0 We have hadoop related docs in several places at the moment: * *dev/batch/connectors.md* (Hadoop FS implementations and setup) * *dev/batch/hadoop_compatibility.md* (not valid any more that Flink always has Hadoop types out of the box as we do not build and provide Flink with Hadoop by default) * *ops/filesystems/index.md* (plugins, Hadoop FS implementations and setup revisited) * *ops/deployment/hadoop.md* (Hadoop classpath) * *ops/config.md* (deprecated way to provide Hadoop configuration in Flink conf) We could consolidate all these pieces of docs into a consistent structure to help users to navigate through the docs to well-defined spots depending on which feature they are trying to use. The places in docs which should contain the information about Hadoop: * *dev/batch/hadoop_compatibility.md* (only Dataset API specific stuff about integration with Hadoop) * *ops/filesystems/index.md* (Flink FS plugins and Hadoop FS implementations) * *ops/deployment/hadoop.md* (Hadoop configuration and classpath) How to setup Hadoop itself should be only in *ops/deployment/hadoop.md*. All other places dealing with Hadoop/HDFS should contain only their related things and just reference it 'how to configure Hadoop'. Like all chapters about writing to file systems (batch connectors and streaming file sinks) should just reference file systems. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13927) Add note about hadoop dependencies for local debug
Andrey Zagrebin created FLINK-13927: --- Summary: Add note about hadoop dependencies for local debug Key: FLINK-13927 URL: https://issues.apache.org/jira/browse/FLINK-13927 Project: Flink Issue Type: Improvement Components: Connectors / FileSystem, Documentation, FileSystems Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Currently if user tries to run the job locally (e.g. from IDE) and uses Hadoop fs, it will not work if hadoop dependencies are not on the class path which is the case for the example from the quick start. We can add a hint about adding provided hadoop dependencies to: [https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/hadoop.html] and cross reference with: [https://ci.apache.org/projects/flink/flink-docs-master/ops/filesystems/index.html#hadoop-configuration] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13820) Breaking long function argument lists and chained method calls
Andrey Zagrebin created FLINK-13820: --- Summary: Breaking long function argument lists and chained method calls Key: FLINK-13820 URL: https://issues.apache.org/jira/browse/FLINK-13820 Project: Flink Issue Type: Sub-task Components: Documentation, Project Website Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Breaking the line of too long statements (line longness is yet to be fully defined) to improve code readability in case of * Long function argument lists (declaration or call): void func(type1 arg1, type2 arg2, ...) * Long sequence of chained calls: list.stream().map(...).reduce(...).collect(...)... Rules: * Break the list of arguments/calls if the line exceeds limit or earlier if you believe that the breaking would improve the code readability * If you break the line then each argument/call should have a separate line, including the first one * Each new line argument/call should have one extra indentation relative to the line of the parent function name or called entity * The opening parenthesis always stays on the line of the parent function name * The possible thrown exception list is never broken and stays on the same last line * The dot of a chained call is always on the line of that chained call proceeding the call at the beginning Examples of breaking: * Function arguments {code:java} public void func( int arg1, int arg2, ...) throws E1, E2, E3 { }{code} * Chained method calls: {code:java} values .stream() .map(...) .collect(...);{code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13819) Introduce RpcEndpoint State
Andrey Zagrebin created FLINK-13819: --- Summary: Introduce RpcEndpoint State Key: FLINK-13819 URL: https://issues.apache.org/jira/browse/FLINK-13819 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.10.0, 1.9.1 To better reflect the lifecycle of RpcEndpoint, we suggest to introduce its state: * created * started * stopping We can use the state e.g. to make decision about how to react on API calls if it is already known that the RpcEndpoint is terminating, as required e.g. for FLINK-13769. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13812) Code style for the usage of Java Optional
Andrey Zagrebin created FLINK-13812: --- Summary: Code style for the usage of Java Optional Key: FLINK-13812 URL: https://issues.apache.org/jira/browse/FLINK-13812 Project: Flink Issue Type: Sub-task Components: Documentation, Project Website Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13804) Collections initial capacity
Andrey Zagrebin created FLINK-13804: --- Summary: Collections initial capacity Key: FLINK-13804 URL: https://issues.apache.org/jira/browse/FLINK-13804 Project: Flink Issue Type: Sub-task Components: Documentation, Project Website Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin The code style conclusion to add to web site: Set the initial capacity only if there is a good proven reason to do it. Otherwise do not clutter the code with it. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13802) Flink code style guide
Andrey Zagrebin created FLINK-13802: --- Summary: Flink code style guide Key: FLINK-13802 URL: https://issues.apache.org/jira/browse/FLINK-13802 Project: Flink Issue Type: Task Components: Documentation, Project Website Reporter: Andrey Zagrebin This is an umbrella issue to introduce and improve Flink code style guide. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13642) Refactor ShuffleMaster to optionally provide preferred TM location for produced partitions
Andrey Zagrebin created FLINK-13642: --- Summary: Refactor ShuffleMaster to optionally provide preferred TM location for produced partitions Key: FLINK-13642 URL: https://issues.apache.org/jira/browse/FLINK-13642 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Andrey Zagrebin -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13641) Consider removing UnknownInputChannel in NettyShuffleEnvironment
Andrey Zagrebin created FLINK-13641: --- Summary: Consider removing UnknownInputChannel in NettyShuffleEnvironment Key: FLINK-13641 URL: https://issues.apache.org/jira/browse/FLINK-13641 Project: Flink Issue Type: Improvement Components: Runtime / Network Reporter: Andrey Zagrebin UnknownInputChannel basically is a place holder while the producer is unknown which contains some partition parameters to use later for creation of the known channel. If NettyShuffleEnvironment#updatePartitionInfo provides enough information to add the known channel, one could consider removing UnknownInputChannel and refactoring indexed channel array into a dynamic list. Initially suggested in [https://github.com/apache/flink/pull/8362#discussion_r290308989] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13640) Consider TaskDeploymentDescriptorFactory to be Execution specific
Andrey Zagrebin created FLINK-13640: --- Summary: Consider TaskDeploymentDescriptorFactory to be Execution specific Key: FLINK-13640 URL: https://issues.apache.org/jira/browse/FLINK-13640 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Andrey Zagrebin suggested in [https://github.com/apache/flink/pull/8362#discussion_r290297974] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13639) Consider refactoring of IntermediateResultPartitionID to consist of IntermediateDataSetID and partitionIndex
Andrey Zagrebin created FLINK-13639: --- Summary: Consider refactoring of IntermediateResultPartitionID to consist of IntermediateDataSetID and partitionIndex Key: FLINK-13639 URL: https://issues.apache.org/jira/browse/FLINK-13639 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Andrey Zagrebin -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13638) Refactor RemoteChannelStateChecker#isProducerConsumerReadyOrAbortConsumption to return result action
Andrey Zagrebin created FLINK-13638: --- Summary: Refactor RemoteChannelStateChecker#isProducerConsumerReadyOrAbortConsumption to return result action Key: FLINK-13638 URL: https://issues.apache.org/jira/browse/FLINK-13638 Project: Flink Issue Type: Improvement Components: Runtime / Network Affects Versions: 1.9.0, 1.10.0 Reporter: Andrey Zagrebin RemoteChannelStateChecker#isProducerConsumerReadyOrAbortConsumption either triggers some action (fail or cancel) or returns a decision (trigger new partition check or not). It would be more symmetric if this class would not trigger any action but only return a decision what to do: {code:java} enum Action { FAIL(Throwable cause), CANCEL(String msg), TRIGGER_PARTITION_CHECK, NOOP } {code} Then the caller would be responsible for making the action. That way this class would only need access to responseHandle{{.getProducerExecutionState()}} and not responseHandle{{ }}itself. >From PR discussion >[https://github.com/apache/flink/pull/8463#discussion_r288290783] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13581) BatchFineGrainedRecoveryITCase failed on Travis
Andrey Zagrebin created FLINK-13581: --- Summary: BatchFineGrainedRecoveryITCase failed on Travis Key: FLINK-13581 URL: https://issues.apache.org/jira/browse/FLINK-13581 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Andrey Zagrebin Fix For: 1.9.0 [https://travis-ci.com/flink-ci/flink/jobs/221567908] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13435) Remove ShuffleDescriptor.ReleaseType and make release semantics fixed per partition type
Andrey Zagrebin created FLINK-13435: --- Summary: Remove ShuffleDescriptor.ReleaseType and make release semantics fixed per partition type Key: FLINK-13435 URL: https://issues.apache.org/jira/browse/FLINK-13435 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination, Runtime / Network Affects Versions: 1.9.0 Reporter: Andrey Zagrebin Fix For: 1.9.0, 1.10.0 In a long term we do not need auto-release semantics for blocking (persistent) partition. We expect them always to be released externally by JM and assume they can be consumed multiple times. The pipelined partitions have always only one consumer and one consumption attempt. Afterwards they can be always released automatically. ShuffleDescriptor.ReleaseType was introduced to make release semantics more flexible but it is not needed in a long term. FORCE_PARTITION_RELEASE_ON_CONSUMPTION was introduced as a safety net to be able to fallback to 1.8 behaviour without the partition tracker and JM taking care about blocking partition release. We can make this option specific for NettyShuffleEnvironment which was the only existing shuffle service before. If it is activated then the blocking partition is also auto-released on a consumption attempt as it was before. The fine-grained recovery will just not find the partition after the job restart in this case and will restart the producer. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13408) Schedule StandaloneResourceManager.setFailUnfulfillableRequest whenever the leadership is acquired
Andrey Zagrebin created FLINK-13408: --- Summary: Schedule StandaloneResourceManager.setFailUnfulfillableRequest whenever the leadership is acquired Key: FLINK-13408 URL: https://issues.apache.org/jira/browse/FLINK-13408 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Andrey Zagrebin Fix For: 1.9.0, 1.10.0 We introduced _StandaloneResourceManager.__setFailUnfulfillableRequest_ to give some time to task executors to register the available slots before the slot requests can be checked whether they can be fulfilled or not. _setFailUnfulfillableRequest_ is scheduled now only once when the RM is initialised but the task executors will register themselves every time this RM gets the leadership. Hence, _setFailUnfulfillableRequest_ should be scheduled after each leader election. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13407) Fix thread visibility of checked SlotManager.failUnfulfillableRequest in StandaloneResourceManagerTest
Andrey Zagrebin created FLINK-13407: --- Summary: Fix thread visibility of checked SlotManager.failUnfulfillableRequest in StandaloneResourceManagerTest Key: FLINK-13407 URL: https://issues.apache.org/jira/browse/FLINK-13407 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Andrey Zagrebin Fix For: 1.9.0, 1.10.0 One of the potential reasons of the test instability, described in FLINK-13242, is that the test checks _failUnfulfillableRequest_ directly in its thread but it is set in the main RPC thread of _StandaloneResourceManager#initialize_. It can lead to a potential problem with the visibility of _SlotManager.failUnfulfillableRequest_ in _StandaloneResourceManagerTest#testStartupPeriod_. The idea for the fix is to read _SlotManager.failUnfulfillableRequest_ in the main thread of _StandaloneResourceManager_, get future result and check in the _StandaloneResourceManagerTest.assertHappensUntil_. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13371) Release partitions in JM of producer gets restarted
Andrey Zagrebin created FLINK-13371: --- Summary: Release partitions in JM of producer gets restarted Key: FLINK-13371 URL: https://issues.apache.org/jira/browse/FLINK-13371 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Runtime / Network Affects Versions: 1.9.0 Reporter: Andrey Zagrebin As discussed in FLINK-13245, there can be a case that producer does not even detect any consumption attempt if consumer fails before the connection is established. It means we cannot fully rely on shuffle service for the release on consumption in case of consumer failure. When producer restarts it will leak partitions from the previous attempt. Previously we had an explicit release call for this case in Execution.cancel/suspend. Basically JM has to explicitly release all partitions produced by the previous task execution attempt in case of producer restart, including `released on consumption` partitions. For this change, we might need to track all partitions in PartitionTrackerImpl. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13169) IT test for fine-grained recovery (task executor failures)
Andrey Zagrebin created FLINK-13169: --- Summary: IT test for fine-grained recovery (task executor failures) Key: FLINK-13169 URL: https://issues.apache.org/jira/browse/FLINK-13169 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.9.0 The BatchFineGrainedRecoveryITCase can be extended with an additional test failure strategy which abruptly shuts down the task executor. This leads to the loss of all previously completed and the in-progress mapper result partitions. The fail-over strategy should subsequently restart the current in-progress mapper and all previous mappers because the previous result is unavailable. When the source is recomputed, all mappers has to be restarted again to recalculate their lost results. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13019) IT test for fine-grained recovery
Andrey Zagrebin created FLINK-13019: --- Summary: IT test for fine-grained recovery Key: FLINK-13019 URL: https://issues.apache.org/jira/browse/FLINK-13019 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.9.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12960) Move ResultPartitionDeploymentDescriptor#releasedOnConsumption to PartitionDescriptor#releasedOnConsumption
Andrey Zagrebin created FLINK-12960: --- Summary: Move ResultPartitionDeploymentDescriptor#releasedOnConsumption to PartitionDescriptor#releasedOnConsumption Key: FLINK-12960 URL: https://issues.apache.org/jira/browse/FLINK-12960 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.9.0 ResultPartitionDeploymentDescriptor#releasedOnConsumption shows the intention how the partition is going to be used by the shuffle user. If it is not supported by the shuffle service for a certain type of partition, ShuffleMaster#registerPartitionWithProducer and ShuffleEnvironment#createResultPartitionWriters should throw an exception. ShuffleMaster#registerPartitionWithProducer takes PartitionDescriptor. ResultPartitionDeploymentDescriptor#releasedOnConsumption should be part of PartitionDescriptor so that not only ShuffleEnvironment but also ShuffleMaster is already aware about releasedOnConsumption. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12890) Add partition lifecycle related Shuffle API
Andrey Zagrebin created FLINK-12890: --- Summary: Add partition lifecycle related Shuffle API Key: FLINK-12890 URL: https://issues.apache.org/jira/browse/FLINK-12890 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Andrey Zagrebin Assignee: Andrey Zagrebin Fix For: 1.9.0 At the moment we have ShuffleEnvironment.releasePartitions which is used to release locally occupied resources of partition. JM can also use it by calling TaskExecutorGateway.releasePartitions. To support lifecycle management of partitions (FLINK-12069, relevant mostly for batch and blocking partitions), we need to extend Shuffle API: * ShuffleDescriptor.hasLocalResources() indicates that this partition occupies local resources on TM and requires TM running to consume the produced data (e.g. true for default NettyShuffleEnviroment and false for externally stored partitions). If a partition needs external lifecycle management and is not released after the first consumption is done (ResultPartitionDeploymentDescriptor.isReleasedOnConsumption()), then RM/JM should keep TMs, which produce these partitions, running until partition still needs to be consumed. The connection to these TMs should also to be kept to issue the RPC call TaskExecutorGateway.releasePartitions once partition is not needed any more. * ShuffleMaster.removePartitionExternally(): JM should call this whenever the partition doe not need to be consumed any more. This call releases partition resources possibly occupied externally outside of TM and should not depend on ShuffleDescriptor.hasLocalResources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12873) Create a separate maven module for Shuffle API
Andrey Zagrebin created FLINK-12873: --- Summary: Create a separate maven module for Shuffle API Key: FLINK-12873 URL: https://issues.apache.org/jira/browse/FLINK-12873 Project: Flink Issue Type: Sub-task Components: Runtime / Configuration, Runtime / Network Reporter: Andrey Zagrebin At the moment, shuffle service API is a part of flink-runtime maven module. The implementers of other shuffle services will have to depend on the fat dependency of flink-runtime. We should consider factoring out the shuffle API interfaces to a separate maven module which depends only on flink-core. Later we can consider the same for the custom high availability services. The final structure could be e.g. (up to discussion): * flink-runtime (already includes default shuffle and high availability implementations) * flink-runtime-extensions ** flink-runtime-extensions-core ** flink-shuffle-extensions-api ** flink-high-availability-extensions-api -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12731) Load shuffle service implementations from plugin manager
Andrey Zagrebin created FLINK-12731: --- Summary: Load shuffle service implementations from plugin manager Key: FLINK-12731 URL: https://issues.apache.org/jira/browse/FLINK-12731 Project: Flink Issue Type: Sub-task Reporter: Andrey Zagrebin The simple way to load shuffle service is to do it from class path with the default class loader. Additionally, shuffle service implementations can be loaded as plugins with their own class loaders using PluginManager. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12706) Introduce ShuffleManager interface and its configuration
Andrey Zagrebin created FLINK-12706: --- Summary: Introduce ShuffleManager interface and its configuration Key: FLINK-12706 URL: https://issues.apache.org/jira/browse/FLINK-12706 Project: Flink Issue Type: Sub-task Components: Runtime / Configuration, Runtime / Coordination, Runtime / Network Reporter: Andrey Zagrebin Fix For: 1.9.0 ShuffleManager should be a factory for ShuffleMaster in JM and ShuffleService in TM. The default implementation is already available former NetworkEnvironment. To make it pluggable, we need to provide a service loading for the configured ShuffleManager implementation class in Flink configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)