manojpec commented on pull request #4114:
URL: https://github.com/apache/hudi/pull/4114#issuecomment-979410799


   @nsivabalan @vinothchandar , Here are the nuances in the existing code
   
   1. SparkRDDWriteClient constructor attempts to create the MT Writer (and 
there by the bootstrapping) if there is no upgrade needed. In our case, after 
upgrade to 0.10, the first time this code path will not be taken. 
   2. However, for each table operation like insert*/upsert/bulkInsert, the 
client gets the table and inits the context via 
`SparkRDDWriteClient#getTableAndInitCtx()`. This is mostly one which triggers 
the actual table upgrade for the first time.
      2.1 After the upgrade it creates the MT writer which then triggers the 
bootstrapping
      2.2 MT bootstrapping fails as there are pending actions on the timeline
      2.3 The writer goes and initiates the respective action executors 
   3. At the time of commit/finalizewrite, attempt is made to write to the MT. 
Action executors call HoodieSparkTable#getMetadataWriter(). This is where the 
next problem is
   4. Before the fix, as a perf optimization, a flag 
`isMetadataAvailabilityUpdated` is used to detect MT available or not. After 
the first check, if for any reason the MT is not ready (which is my testing 
case, after upgrade bootstrapping failed due to a  pending table service 
action), the other flag `isMetadataTableAvailable` is set to false. Now we have 
the `isMetadataAvailabilityUpdated=true` = true and  
`isMetadataTableAvailable=false`. This is the spark table which is going to be 
used by the write client for any subsequent table operations. So, all these 
subsequent table operations are short circuited   because of the flags and 
never try to re-read the MT availability.
   
   5. Later, if the async table services are enabled, a clustering thread would 
come to repair the timeline for any pending actions. If the async table 
services are not enabled, which is the case in my testing, then there would be 
no one to repair the pending action. Assuming the timeline is repaired now, it 
is good for MT bootstrapping again. But, some one should attempt the writer 
creation agian.
   6. Lets say we have this single writer in continuous mode still running and 
no other writer client. Yes, each table operatios is calling 
`HoodieSparkTable#getMetadataWriter()` to get the writer, but they don't 
attempt to create the MT again because of the flags.
   
   What will save us:
   1. Creation of new writer client
   2. Change in schema which will force the existing writer client to re-init 
the context including the MT writer 
   3. Async table service operations repairing the timeline
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to