Re: [PR] [HUDI-9025] perf: improve append performance by reducing avro schema comparisons [hudi]

via GitHub Mon, 10 Mar 2025 06:38:05 -0700


TheR1sing3un commented on PR #12839:
URL: https://github.com/apache/hudi/pull/12839#issuecomment-2710624995


   
   > We already have the reference check for the `schema::equals` method, do 
you mean one of the schema comes from the incoming new record?
   
   During the lifetime of the JVM, there may be many tasks running, and those 
tasks who call the `getCachedSchema` first will first put their own created 
`Schema` variable into it, then other tasks will invoke `get()` but the 
reference does not match, it will use `Schema::equals` to compare.
   
   > Can a local `AvroSchemaCache` like cache solves the problem?(for the 
schema from input record and schema from the append handle, we always fetch it 
from the cache).
   
   I think it may not be very well implemented. For example, if we create a 
thread local cache at the thread level, but spark's executor uses a thread pool 
to schedule the received tasks, so a cache per thread will still have the same 
problem, because a thread may run many tasks one by one. Each task still 
creates its own `Schema` variable. If you want to implement a local cache, the 
scope of the cache can only be the task level, not the thread level and not the 
JVM level.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9025] perf: improve append performance by reducing avro schema comparisons [hudi]

Reply via email to