sydneyhoran commented on issue #12734:
URL: https://github.com/apache/hudi/issues/12734#issuecomment-2657453441

   Hi @ad1happy2go - I am @sweir-thescore's teammate. We don't have any live 
examples as we had to repair them all in real time. But this is an example of 
what it looks like when we have an incomplete rollback (only the top two 
`.rollback.requested` and `.rollback.inflight` files) - can ignore the 
highlighted file. Once these 2 files are manually deleted, jobs typically 
succeed, but may eventually get out of sync again or throw one of the other 
types of error.
   
   <img width="1643" alt="Image" 
src="https://github.com/user-attachments/assets/e5c57f09-ee43-4812-8262-f4c18be9da32";
 />
   
   Every subsequent job will fail and throw an error like:
   
   `Caused by: org.apache.hudi.exception.HoodieRollbackException: Found commits 
after time :20240913164753886, please rollback greater commits first` 
   
   `org.apache.hudi.timeline.service.RequestHandler: Bad request response due 
to client view behind server view`
   
   `HoodieMetadataException: Metadata table's deltacommits exceeded 1000: this 
is likely caused by a pending instant in the data table`
   
   `Caused by: org.apache.hudi.exception.HoodieIOException: Failed to read 
footer for parquet 
gs://.../inserted_at_date=2025-01-19/..._20250129044440519.parquet`
   
   `Caused by: java.io.FileNotFoundException: File not found: 
gs://.../inserted_at_date=2025-01-19/..._20250129044440519.parquet`
   
   (although these may be separate issues on their own)
   
   Of note, we only see this error when running a GCE cluster with a dedicated 
driver pool. We switched back to the regular node type of GCE cluster and no 
longer face this issue when a job is cancelled or fails. The spark Drivers on 
the dedicated driver pool also required about 5x more memory (i.e. 5GB instead 
of 1GB on the current cluster), and still sometimes faced OOMs (that will lead 
to the below errors).
   
   We are also investigating this within GCP/Dataproc and replanning our 
approach to how we want to architect the cluster, but these metadata/timeline 
issues were the primary reason why we could not switch to the new cluster 
configuration. So wanted to check if any thoughts here as well.
   
   Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to