Re: [I] [SUPPORT]xxx.parquet is not a Parquet file [hudi]

via GitHub Mon, 13 May 2024 03:21:47 -0700


MrAladdin commented on issue #11178:
URL: https://github.com/apache/hudi/issues/11178#issuecomment-2107197799


   > @MrAladdin
   > 
   > 1. Ideally this should not be the reason for this exception, as it's more 
like parquet file only got corrupted. Are you facing this issue frequently?
   > 2. Not very sure about it. Adding @xushiyan in case he knows.
   > 3. if individual hfile file are too large, you can increase file group 
count. Seems like in each file group there are too many record keys assigned. 
One you restart the writer (spark streaming job) it will take effect for new 
writes. To fix the size of the already existing index files, you may need to 
create record index again only.
   
   1.The problem occasionally encountered in version 0.12, the solution is to 
delete the damaged files with the command hadoop fs -rm -r. Now, after 
upgrading, this issue appears for the first time in version 0.14.
   3.In the ideal state, does each hfile file in the record_index maintain a 
size of 1GB, and how to rebuild the overly large record_index, is it through a 
simple command or by rewriting the data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT]xxx.parquet is not a Parquet file [hudi]

Reply via email to