guanziyue edited a comment on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-994684618


   > Hi vinothchandar:
   Concurrent writing to HoodieParquetWriter occurs at following code
   
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkMergeHelper.java#L103
   When speculation is triggered, we call mergeHandle.close which calls 
parquetWriter close method. At the same time, boundedInMemoryExecutor is still 
working, so write method of mergeHandle is called at same time which call write 
method of parquetWriter.
   And parquetWriter does have a state which is not thread safe. It holds 
BytesInput which is used as internal data storage in parquet column format, it 
is not thread safe and its life cycle is managed by parquetWriter.  So parquet 
writer must transfer its state in a serializable way. When it is being written, 
a reset command may not totally clear it as expected. Such data structure is 
reused within JVM. A non-cleared bytesInput may return wrong result in 
following usage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to