voonhous commented on issue #9615:
URL: https://github.com/apache/hudi/issues/9615#issuecomment-1709869920

   We encountered this internally too, here's the Spark-UI screenshots:
   
   # Duplicated files
   1. Notice that the instants are the same
   2. Please take note of the write tokens 
   3. There are two duplicated fileIDs: `00000014-fbad-473e-9975-b8619cb93e2d` 
and `00000038-bb64-4a32-a930-74efee799a0d`
   5. The format of the write token is as such: `{index}_{irrelevant}_{taskID}`
   
   ```
   -rw-rw-r--   3 test test     447941 2023-07-16 07:03 
hdfs://path_to_table/part_date=2023-07-16/bucket_id=0/00000038-bb64-4a32-a930-74efee799a0d_10-3-130_20230716000128334.parquet
   -rw-rw-r--   3 test test          4 2023-07-16 07:04 
hdfs://path_to_table/part_date=2023-07-16/bucket_id=0/00000038-bb64-4a32-a930-74efee799a0d_10-3-76_20230716000128334.parquet
   
   -rw-rw-r--   3 test test          4 2023-07-16 07:04 
hdfs://path_to_table/part_date=2023-07-16/bucket_id=0/00000014-fbad-473e-9975-b8619cb93e2d_38-3-104_20230716000128334.parquet
   -rw-rw-r--   3 test test     446996 2023-07-16 07:03 
hdfs://path_to_table/part_date=2023-07-16/bucket_id=0/00000014-fbad-473e-9975-b8619cb93e2d_38-3-131_20230716000128334.parquet
   ```
   
   # Screenshots
   
   With Speculative execution enabled, The taskID `76` and `104` matches the 
write tokens in my previous comments. (Parquet size with size 4)
   
   
![image](https://github.com/apache/hudi/assets/6312314/91e70d24-e7f1-44b6-bdef-5a1a7f66f5cb)
   
   Affected host that were stuck in the running stage: `10-169-32-6`.
   
   The launch time for the containers for the affected taskIds are the same at 
`2023-07-16 07:03:45`.
   
   As the containers that were running taskID `76` and `104` has not succeeded 
yet, speculative execution started another batch of writes for the same set of 
data with taskID `130` and `131`. 
   
   The launch time for the SUCCESSFUL containers for the taskIds are the same 
at `2023-07-16 07:03:55`.
   
   The second batch with taskID `130` and `131` was successful.
   
   
![image](https://github.com/apache/hudi/assets/6312314/01ab51f9-9cb0-48c9-a4b7-7b47a7d9734a)
   
   
![image](https://github.com/apache/hudi/assets/6312314/c1d8b303-7f49-4eff-a485-ca68706aba4b)
   
   After the second batch succeeded, Spark will attempt to kill all running 
tasks that may still be running due to speculative execution. There is no 
callback or ACK from the containers that they have indeed been killed before 
reconcile against marker is executed. Reconcile against markers stage is 
started right after.
   
   Reconcile against markers was started at: `2023-07-16 07:03:57` and finishes 
promptly in < 1s.
   
![image](https://github.com/apache/hudi/assets/6312314/b5f6f327-a000-4a3d-9691-cc169a32435f)
   
   As the container running taskID `76` and `104` has not been killed yet, the 
container may attempt to write files after reconcile against markers. 
   
   Apologies, i do not have any screenshot for this, but i do have the logs and 
the creation time down to the millisecond of when the marker was created. 
Markers are created before actual flushing of data to parquet files.
   
   As can be seen, they are created after reconcile against marker has finished 
executing.
   
   ```
   2023-07-16 07:03:58,962: 
00000038-bb64-4a32-a930-74efee799a0d_10-3-76_20230716000128334.parquet.marker.CREATE
   2023-07-16 07:03:59,833: 
00000014-fbad-473e-9975-b8619cb93e2d_38-3-104_20230716000128334.parquet.marker.CREATE
   ```
   
   After reconcile against marker has completed, commit will be executed and 
the driver will exit. 
   
   At this point in time, the container running taskID `76` and `104` is still 
flushing data to the parquet files, since the driver is terminated, all 
executors will be terminated along with it, causing parquet size = 4 (corrupt 
parquet files).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to