jenu9417 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1438812225

   @umehrot2 
   I checked the `  .hoodie/archived`   folder. There no files present under 
that folder.
   Also, I tried to run by turning off Hive sync (by omitting --enable-sync 
flag in the command)
   The number of requests had came down significantly. The number of requests 
got reduced by roughly 95%.
   Tried to filter the number of API requests happening for the prefix ` 
/data/testfolder`    for ingesting 1000 records (900 inserts + 100 updates)
   With Hive Sync Enabled:
   ```
   HEAD -  799
   GET -  86
   PUT - 359
   DELETE - 78
   LIST - 1271   (Happening in the bucket at the same time. Not for the same 
prefix)
   ```
   
   Without Hive Sync:
   ```
   HEAD -  35
   GET -  8
   PUT - 3
   DELETE - 7
   LIST - 1076  (Happening in the bucket at the same time. Not for the same 
prefix)
   ```
   
   Here all other requests have reduced except for LIST request. LIST requests 
are not happening for the same target prefix (/data/testfolder), but happening 
for the entire bucket (like /data) at the same time. There are no other writes 
happening to this bucket. Verified that. Also, these LIST requests are 
happening at the same time as well. There are other prefixes / tables inside 
the same bucket, which have data, but no active read is happening. (Like 
/data/newtestfolder/)
   By any chance, hudi is trying to list all those files in the parent prefix 
(/data)? Not sure. But could this be a reason?
   
   a) What could be the reason for higher number of LIST operations happening? 
Is it possible to reduce them?
   
   b) Now that, we have more or less established hive sync is the root cause of 
the problem, what could be the solution for us here? Any work around? Will 
downgrading to a lower version help? Any particular EMR version, which is 
stable you could suggest?
   
   c) How to correlate the number of different API calls to the Write operation 
happening? We are trying to get the numbers of write of 1 record to understand 
how it will expand for 1000 records. By looking at the current numbers, there 
doesn't seem to be much correlation. Is there any particular documentation or 
blog that could be helpful here? We are trying to evaluate the feasibility of 
using S3 as primary storage. For this we need to understand this API call usage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to