[jira] [Commented] (FLINK-22069) Check Log Pollution for 1.13 release

Chesnay Schepler (Jira) Fri, 09 Apr 2021 05:36:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317955#comment-17317955
 ]


Chesnay Schepler commented on FLINK-22069:
------------------------------------------

h3. JM

This seems unnecessary (maybe replace with a line for the start of the 
ResourceManager as a whole):
{code}
2021-04-09 10:57:15,245 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Starting the slot manager.
{code}

Can't really fix it, but this will probably raise some eyebrows:
{code}
2021-04-09 13:11:09,002 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - DataSink 
(collect()) (1/1) (c8368e6fff97b7439058fdbdc6a9fd3d) switched from DEPLOYING to 
RECOVERING.
{code}

We should adjust this to say "Received resource requirements":
{code}
StandaloneResourceManager [] - Received resource declaration for job 
e07f276fa0ab138c33fe27f39ebfbe2e:
{code}
Additionally, if the requirements are empty we could rephrase it to something 
like "clearing resource requirements for job ...".

This line ends with 2 periods.
{code}
JobMaster                 [] - Close ResourceManager connection 
20b01ee5165e96fed972ddf74e9e710b [...]
{code}

Quotes could be neat here:
{code}
JobMaster                 [] - Checkpoint storage is set to JobManager
Starting execution of job State machine job [..]
{code}

Debug messages for loaded FileSystems could be improved by logging the class 
name?
{code}
FileSystem                          [] - Added file system 
maprfs:org.apache.flink.runtime.fs.maprfs.MapRFsFactory@38e79ae3
{code}

Batch job; the last line looks weird:
{code}
2021-04-09 14:10:07,909 DEBUG org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Send next input split GenericSplit (0/1).
2021-04-09 14:10:07,919 DEBUG 
org.apache.flink.api.common.io.DefaultInputSplitAssigner     [] - No more input 
splits available
2021-04-09 14:10:07,919 DEBUG org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Send next input split null.
{code}

h3. TM

This line shows up 3 times whenever a TM is started:
{code}
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled 
external resources: []
{code}

It's not clear what the difference here is:
{code}
org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - 
FileChannelManager uses directory 
/tmp/flink-io-cbd9feba-f02e-469d-9abb-4bc1423d805c for spill files.
org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - 
FileChannelManager uses directory 
/tmp/flink-netty-shuffle-6f5b22e6-226b-43ba-aebd-e346ed641cb2 for spill files.
{code}

Time formatting could be improved here:
{code}
org.apache.flink.runtime.util.LeaderRetrievalUtils           [] - TaskManager 
will try to connect for PT10S before falling back to heuristics
{code}

Never quite understood what the value of these lines is supposed to be:
{code}
Task                    [] - Registering task at network: DataSink (collect()) 
(1/1)#0 (c8368e6fff97b7439058fdbdc6a9fd3d) [DEPLOYING].
{code}

Quotes could be neat here:
{code}
StreamTask          [] - Checkpoint storage is set to JobManager
{code}

These errors are logged on debug, with excessive stack traces, and it is 
unclear whether this is a problem or not:
{code}
DEBUG org.apache.flink.shaded.netty4.io.netty.util.internal.NativeLibraryLoader 
[] - Unable to load the library 
'org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64'
DEBUG org.apache.flink.shaded.netty4.io.netty.util.internal.NativeLibraryLoader 
[] - org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64
{code}
But immediately afterwards this show up:
{code}
NativeLibraryLoader [] - Successfully loaded the library 
/tmp/liborg_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64670877569787969381.so
{code}

h3. Client

Surprisingly we don't log for job submission where we submit them to or what 
the job ID/name is.

Global config is loaded twice by the client for batch jobs.

We should finally update the default config in the distribution:
{code}
Configuration                 [] - Config uses fallback configuration key 
'jobmanager.rpc.address' instead of key 'rest.address'
{code}

Submitting a job without enough slots being available causes the mother of all 
stacktraces to show up (110 lines...).

When a submission times out because the cluster is not reachable, then the rest 
client shutdown also fails:
{code}
2021-04-09 14:18:05,160 WARN  org.apache.flink.runtime.rest.RestClient          
           [] - Rest endpoint shutdown failed.
java.util.concurrent.TimeoutException: null
        at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) 
~[?:1.8.0_222]
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) 
~[?:1.8.0_222]
        at 
org.apache.flink.runtime.rest.RestClient.shutdown(RestClient.java:180) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at 
org.apache.flink.client.program.rest.RestClusterClient.close(RestClusterClient.java:239)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        ...
{code}

> Check Log Pollution for 1.13 release
> ------------------------------------
>
>                 Key: FLINK-22069
>                 URL: https://issues.apache.org/jira/browse/FLINK-22069
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Stephan Ewen
>            Assignee: Chesnay Schepler
>            Priority: Blocker
>             Fix For: 1.13.0
>
>
> We should check for log pollution and confusing log lines before the release.
> Below are some lines I stumbled over while using Flink during testing.
> -----------------------------
> These lines show up on any execution of a local job and make me think I 
> forgot to configure something I probably should have, wondering whether this 
> might cause problems later?
> These have been in Flink for a few releases now, might be worth rephrasing, 
> though.
> {code}
> 2021-03-30 17:57:22,483 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils [] - The 
> configuration option taskmanager.cpu.cores required for local execution is 
> not set, setting it to the maximal possible value.
> 2021-03-30 17:57:22,483 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils [] - The 
> configuration option taskmanager.memory.task.heap.size required for local 
> execution is not set, setting it to the maximal possible value.
> 2021-03-30 17:57:22,483 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils [] - The 
> configuration option taskmanager.memory.task.off-heap.size required for local 
> execution is not set, setting it to the maximal possible value.
> 2021-03-30 17:57:22,483 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils [] - The 
> configuration option taskmanager.memory.network.min required for local 
> execution is not set, setting it to its default value 64 mb.
> 2021-03-30 17:57:22,483 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils [] - The 
> configuration option taskmanager.memory.network.max required for local 
> execution is not set, setting it to its default value 64 mb.
> 2021-03-30 17:57:22,483 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils [] - The 
> configuration option taskmanager.memory.managed.size required for local 
> execution is not set, setting it to its default value 128 mb.
> {code}
> -------------------------------------
> These lines show up on every job start, even if there is no recovery but just 
> a plain job start. They are not particularly problematic, but also not 
> helping.
> {code}
> 2021-03-30 17:57:27,839 INFO  
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - 
> Converting recovered input channels (8 channels)
> 2021-03-30 17:57:27,839 INFO  
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - 
> Converting recovered input channels (8 channels)
> 2021-03-30 17:57:27,839 INFO  
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - 
> Converting recovered input channels (8 channels)
> 2021-03-30 17:57:27,839 INFO  
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - 
> Converting recovered input channels (8 channels)
> 2021-03-30 17:57:27,839 INFO  
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - 
> Converting recovered input channels (8 channels)
> 2021-03-30 17:57:27,839 INFO  
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - 
> Converting recovered input channels (8 channels)
> 2021-03-30 17:57:27,839 INFO  
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - 
> Converting recovered input channels (8 channels)
> 2021-03-30 17:57:27,855 INFO  
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - 
> Converting recovered input channels (8 channels)
> {code}
> --------------------------------
> When using {{DataStream.collect()}} we always have an excpetion in the log 
> for the first fetch attempt, before the JM is ready.
> The loop retries and the program succeeds, but the exception in the log 
> raises confusion about whether there is a swallowed but impactful error.
> {code}
> 7199 [main] WARN  
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher [] - An 
> exception occurs when fetching query results
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.dispatcher.UnavailableDispatcherOperationException: 
> Unable to get JobMasterGateway for initializing job. The requested operation 
> is not available while the JobManager is initializing.
>       at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) 
> ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
>       at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.sendRequest(CollectResultFetcher.java:155)
>  ~[classes/:?]
>       at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:126)
>  [classes/:?]
>       at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:106)
>  [classes/:?]
>       at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
>  [classes/:?]
>       at 
> org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1320)
>  [classes/:?]
>       at 
> org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1303)
>  [classes/:?]
>       at 
> org.apache.flink.runtime.operators.coordination.OperatorEventSendingCheckpointITCase.testLostOperatorEventsLeadsToRecovery(OperatorEventSendingCheckpointITCase.java:88)
>  [test-classes/:?]
>       at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]
>       at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]
>       at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]
>       at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  [junit-4.12.jar:4.12]
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  [junit-4.12.jar:4.12]
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  [junit-4.12.jar:4.12]
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  [junit-4.12.jar:4.12]
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 
> [junit-4.12.jar:4.12]
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  [junit-4.12.jar:4.12]
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  [junit-4.12.jar:4.12]
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 
> [junit-4.12.jar:4.12]
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 
> [junit-4.12.jar:4.12]
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 
> [junit-4.12.jar:4.12]
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 
> [junit-4.12.jar:4.12]
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 
> [junit-4.12.jar:4.12]
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> [junit-4.12.jar:4.12]
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> [junit-4.12.jar:4.12]
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363) 
> [junit-4.12.jar:4.12]
>       at org.junit.runner.JUnitCore.run(JUnitCore.java:137) 
> [junit-4.12.jar:4.12]
>       at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
>  [junit-rt.jar:?]
>       at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
>  [junit-rt.jar:?]
>       at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
>  [junit-rt.jar:?]
>       at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53) 
> [junit-rt.jar:?]
> Caused by: 
> org.apache.flink.runtime.dispatcher.UnavailableDispatcherOperationException: 
> Unable to get JobMasterGateway for initializing job. The requested operation 
> is not available while the JobManager is initializing.
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.getJobMasterGateway(Dispatcher.java:892)
>  ~[classes/:?]
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.performOperationOnJobMasterGateway(Dispatcher.java:902)
>  ~[classes/:?]
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.deliverCoordinationRequestToCoordinator(Dispatcher.java:724)
>  ~[classes/:?]
>       at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]
>       at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]
>       at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]
>       at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
>  ~[classes/:?]
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
>  ~[classes/:?]
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
>  ~[classes/:?]
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
>  ~[classes/:?]
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> ~[scala-library-2.11.12.jar:?]
>       at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> ~[scala-library-2.11.12.jar:?]
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> ~[scala-library-2.11.12.jar:?]
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> ~[scala-library-2.11.12.jar:?]
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:517) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>       at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  ~[akka-actor_2.11-2.5.21.jar:2.5.21]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-22069) Check Log Pollution for 1.13 release

Reply via email to