guanziyue edited a comment on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-994684618
> Hi vinothchandar: Concurrent writing to HoodieParquetWriter occurs at following code https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkMergeHelper.java#L103 When speculation is triggered, we call mergeHandle.close which calls parquetWriter close method. At the same time, boundedInMemoryExecutor is still working, so write method of mergeHandle is called at same time which call write method of parquetWriter. And parquetWriter does have a state which is not thread safe. It holds BytesInput which is used as internal data storage in parquet column format, it is not thread safe and its life cycle is managed by parquetWriter. So parquet writer must transfer its state in a serializable way. When it is being written, a reset command may not totally clear it as expected. Such data structure is reused within JVM. A non-cleared bytesInput may return wrong result in following usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
