Abacn commented on issue #21696:
URL: https://github.com/apache/beam/issues/21696#issuecomment-1317225676

   Well, after #24129 three ParDo loadtest (python and go) now passing, but the 
remaining six tests still fail with different reasons:
   
   beam_LoadTests_Go_SideInput_Flink_Batch: GRPC Error
   ```
   07:00:41 Full error:
   07:00:41 while executing Process for Plan[2-2]:
   07:00:41 2: Discard
   07:00:41 3: PCollection[n8] Out:[2]
   07:00:41 4: ParDo[load.RuntimeMonitor] Out:[3]
   07:00:41 5: PCollection[n7] Out:[4]
   07:00:41 6: ParDo[main.iterSideInputFn] Out:[5]
   07:00:41 1: DataSource[S[fn/read/n6:0@localhost:35493], local_output] 
Coder:W;fn/wire/n6:0<KV;c2<bytes;c0,bytes;c0>>!GWC Out:6
   07:00:41     caused by:
   07:00:41 panic: broken stream: StateChannel[localhost:45275].Send(r1): 
context canceled
   07:00:41     caused by:
   07:00:41 rpc error: code = Internal desc = unexpected EOF goroutine 51 
[running]:
   ```
   
   beam_LoadTests_Go_GBK_Flink_Batch
   beam_LoadTests_Go_Combine_Flink_Batch: Heartbeat timeout
   ```
   05:58:54 Caused by: java.util.concurrent.TimeoutException: Heartbeat of 
TaskManager with id 
container_1668592968333_0001_01_000004(beam-loadtests-go-gbk-flink-batch-715-w-3.c.apache-beam-testing.internal:8026)
 timed out.
   05:58:54     ... 31 more
   05:58:54 2022/11/16 10:58:54  (): java.util.concurrent.TimeoutException: 
Heartbeat of TaskManager with id 
container_1668592968333_0001_01_000004(beam-loadtests-go-gbk-flink-batch-715-w-3.c.apache-beam-testing.internal:8026)
 timed out.
   05:58:54 2022/11/16 10:58:54 Job state: FAILED
   ```
   
   beam_LoadTests_Go_CoGBK_Flink_batch: OOM
   ```
   03:39:28 2022/11/16 08:39:28  (): 
org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 
30932d0296729e9591b3ea8e710c2dc3)
   03:39:28     at 
org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:130)
   ...
   03:39:28 Caused by: java.lang.OutOfMemoryError: Direct buffer memory. The 
direct out-of-memory error has occurred. This can mean two things: either 
job(s) require(s) a larger size of JVM direct memory or there is a direct 
memory leak. The direct memory can be allocated by user code or some of its 
dependencies. In this case 'taskmanager.memory.task.off-heap.size' 
configuration option should be increased. Flink framework and its dependencies 
also consume the direct memory, mostly for network communication. The most of 
network memory is managed by Flink and should not result in out-of-memory 
error. In certain special cases, in particular for jobs with high parallelism, 
the framework may require more direct memory which is not managed by Flink. In 
this case 'taskmanager.memory.framework.off-heap.size' configuration option 
should be increased. If the error persists then there is probably a direct 
memory leak in user code or some of its dependencies which has to be 
investigated and fixe
 d. The task executor has to be shutdown...
   ```
   beam_LoadTests_Python_Combine_Flink_Streaming: time out and OOM
   ```
   15:01:39 AssertionError: Job did not reach to a terminal state after waiting 
indefinitely. Console URL: 
https://console.cloud.google.com/dataflow/jobs/<RegionId>/2022-11-15_07_07_59-7787368501547085875?project
   ```
   another:
   ```
   14:11:11 RuntimeError: Pipeline 
load-tests-python-flink-streaming-combine-4-1115182214_25b36822-a47b-4521-a445-6c31525fd9e9
 failed in state FAILED: java.lang.OutOfMemoryError: Java heap space
   ```
   
   beam_LoadTests_Python_Combine_Flink_Batch
   - still running
   
   These all sounds real issue in production environment (instead of 
configuration issue the test had).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to