[
https://issues.apache.org/jira/browse/HDDS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192703#comment-17192703
]
Rakesh Radhakrishnan commented on HDDS-4222:
--------------------------------------------
Hi [~linyiqun], I have moved the lookup discussion to this jira task as this
requires more detailed and focussed discussion. Thanks a lot for the help!
Please refer to [~linyiqun]'s original comment
{quote}Here the directory cache is used for avoid the additional look up
overheads. Latest design of directory cache hasn't been attached but just some
thoughts from me:
Two type mapping cache will be useful I think:
<KeyName, KeyInfo>, like </vol1/buck1/a/b/c/d/file1, KeyInfo>, so that we can
skip the traverse search from dir table to key table.
<DirName, List<KeyInfo>>, this is used for the listStatus scenario, list files
call can be a very expensive call under Ozone fs namespace.
Cache introduced here can speed up the metadata access but also there are two
aspects we need to consider.
{quote}
Yes, directory cache is most useful during the path component traversal. IMHIO,
this would be the first candidate to target and would greatly help to get
maximum performance benefit during path look ups. That doesn't meant that other
entities like files, list etc is not important. I believe it depends on many
factors like, workloads, hardware (RAM, NVMe)capabilities, how much is the
metadata proportion(dirs, files) in FS namespace, directory hierarchy etc. To
begin with, I am planning to implement a cache framework where OM will provide
facility to plugin different cache entities based on user requirements. Here,
based on the tradeoffs user can add more built-in cache policies and configure
it and tune it accordingly.
{quote}Cache entry eviction policy for this, we cannot cache all the dir/file
entries.
Consistency between dir cache and underlying store. Cache entry will become
stale when db store updated but not synced in corresponding cache entry. The
cache refresh interval time can be introduced here. Only when the cache entry
not updated more than given refresh interval, then we trigger update cache
entry from querying the db table. Users can set different refresh interval time
to ensure the cache freshness based on their scenarios. Also they can disable
this cache by set interval to 0 that means each query will directly access to
db.
Current OM table cache seems not very helpful for dir cache so I came up with
above thoughts.
{quote}
Yes, OM table cache is not helpful. I completely agree with you that the cache
eviction policy is very important to manage the useful entries within the cache
capacity. In the attached document, I proposed an optimized directory
cache(Approach-3) with minimal data to incorporate more entires that benefits
the path component traversal.
For the consistency part, this is a very good point and will take care during
the implementation phase. I was thinking to update the cache during write and
read paths to avoid additional cache refresh cycle. But, I don't have concrete
thoughts on this and need to look into the OM code to do more deeper analysis.
> [OzoneFS optimization] Provide a mechanism for efficient path lookup
> --------------------------------------------------------------------
>
> Key: HDDS-4222
> URL: https://issues.apache.org/jira/browse/HDDS-4222
> Project: Hadoop Distributed Data Store
> Issue Type: New Feature
> Reporter: Rakesh Radhakrishnan
> Assignee: Rakesh Radhakrishnan
> Priority: Major
> Attachments: Ozone FS Optimizations - Efficient Lookup using cache.pdf
>
>
> With the new file system HDDS-2939 like semantics design it requires multiple
> DB lookups to traverse the path component in top-down fashion. This task to
> discuss use cases and proposals to reduce the performance penalties during
> path lookups.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]