[
https://issues.apache.org/jira/browse/HUDI-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493623#comment-17493623
]
Yue Zhang edited comment on HUDI-3301 at 2/17/22, 3:26 AM:
-----------------------------------------------------------
Hi [~manojg] and [~guoyihua]
Just a quick think of this problem, for now we cache logScanner which hold
shard `records` instant at <partition,slice> level, no matter we
`enable.full.scan.log.files` or not.
And if we disable full scan and look up of only interested entries will meet
concurrency issue.
How about do a little change here:
1. If enable full scan ==> we cache scanner at partition level
2. If disable full scan ==> we create new scanner each time, assuming that the
data set obtained by the partial read method is not large also avoid
concurrency issue at root level maybe.
was (Author: zhangyue19921010):
Hi [~manojg] and [~guoyihua]
Just a quick think of this problem, for now we cache logScanner which hold
shard `records` instant at partition level, no matter we
`enable.full.scan.log.files` or not.
And if we disable full scan and look up of only interested entries will meet
concurrency issue.
How about do a little change here:
1. If enable full scan ==> we cache scanner at partition level
2. If disable full scan ==> we cache scanner at slice level, assuming that the
data set obtained by the partial read method is not large also avoid
concurrency issue at root level maybe.
> MergedLogRecordReader inline reading should be stateless and thread safe
> ------------------------------------------------------------------------
>
> Key: HUDI-3301
> URL: https://issues.apache.org/jira/browse/HUDI-3301
> Project: Apache Hudi
> Issue Type: Bug
> Components: metadata
> Reporter: Manoj Govindassamy
> Assignee: Ethan Guo
> Priority: Blocker
> Labels: HUDI-bug
> Fix For: 0.11.0
>
>
> Metadata table inline reading (enable.full.scan.log.files = false) today
> alters instance member fields and not thread safe.
>
> When the inline reading is enabled, HoodieMetadataMergedLogRecordReader
> doesn't do full read of log and base files and doesn't fill in the
> ExternalSpillableMap records cache. Each getRecordsByKeys() thereby will
> re-read the log and base files by design. But the issue here is this reading
> alters the instance members and the filled in records are relevant only for
> that request. Any concurrent getRecordsByKeys() is also modifying the member
> variable leading to NPE.
>
> To avoid this, a temporary fix of making getRecordsByKeys() a synchronized
> method has been pushed to master. But this fix doesn't solve all usecases. We
> need to make the whole class stateless and thread safe for inline reading.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)