jenu9417 commented on issue #7991: URL: https://github.com/apache/hudi/issues/7991#issuecomment-1438812225
@umehrot2 I checked the ` .hoodie/archived` folder. There no files present under that folder. Also, I tried to run by turning off Hive sync (by omitting --enable-sync flag in the command) The number of requests had came down significantly. The number of requests got reduced by roughly 95%. Tried to filter the number of API requests happening for the prefix ` /data/testfolder` for ingesting 1000 records (900 inserts + 100 updates) With Hive Sync Enabled: ``` HEAD - 799 GET - 86 PUT - 359 DELETE - 78 LIST - 1271 (Happening in the bucket at the same time. Not for the same prefix) ``` Without Hive Sync: ``` HEAD - 35 GET - 8 PUT - 3 DELETE - 7 LIST - 1076 (Happening in the bucket at the same time. Not for the same prefix) ``` Here all other requests have reduced except for LIST request. LIST requests are not happening for the same target prefix (/data/testfolder), but happening for the entire bucket (like /data) at the same time. There are no other writes happening to this bucket. Verified that. Also, these LIST requests are happening at the same time as well. There are other prefixes / tables inside the same bucket, which have data, but no active read is happening. (Like /data/newtestfolder/) By any chance, hudi is trying to list all those files in the parent prefix (/data)? Not sure. But could this be a reason? a) What could be the reason for higher number of LIST operations happening? Is it possible to reduce them? b) Now that, we have more or less established hive sync is the root cause of the problem, what could be the solution for us here? Any work around? Will downgrading to a lower version help? Any particular EMR version, which is stable you could suggest? c) How to correlate the number of different API calls to the Write operation happening? We are trying to get the numbers of write of 1 record to understand how it will expand for 1000 records. By looking at the current numbers, there doesn't seem to be much correlation. Is there any particular documentation or blog that could be helpful here? We are trying to evaluate the feasibility of using S3 as primary storage. For this we need to understand this API call usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
