[ 
https://issues.apache.org/jira/browse/HDFS-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209529#comment-16209529
 ] 

Elek, Marton commented on HDFS-12624:
-------------------------------------

I just did a quick test. I have about 5 000 000 entries in the ksm database. 
The full scan without parsing took about 40 seconds.

{code}
long count = this.store.getRangeKVs(null, Integer.MAX_VALUE,
          MetadataKeyFilters.getNormalKeyFilter()).parallelStream().count();
{code}

(Note: currently it's not possible the scan through the database without 
storing everying in the database. I propose to modify the MetadataStore 
interface with adding new methods with returns Stream.)

So it's possible (even with sync operation) to count all the keys during the 
startup.



> Ozone: number of keys/values/buckets to KSMMetrics
> --------------------------------------------------
>
>                 Key: HDFS-12624
>                 URL: https://issues.apache.org/jira/browse/HDFS-12624
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ozone
>    Affects Versions: HDFS-7240
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>
> During my last ozone test with 100 node ozone cluster I see a problem to 
> track how many keys/volumes/buckets do I have.
> I opened this jira to start a discussion about extending KSM metrics (but let 
> me know if this is already planned somewhere else) and add number of 
> keys/volumes/buckets to the metrics interface.
> These counters could be added to anywhere else (for example as a client call) 
> but I think it is an important number and would be worth to monitor it.
> I see multiple ways to achieve it:
> 1. Extend the `org.apache.hadoop.utils.MetadaStore` class with an additional 
> count() method. As I know there is no easy way to implement it with leveldb 
> but with rocksdb there is a posibility to get the _estimated_ number of keys.
> On the other hand KSM stores volumes/buckets/keys in the same db, so we can't 
> use it without splitting the ksm.db to separated dbs.
> 2. Create a background task to iterate over all the keys and count ozone 
> key/volume/bucket numbers:
> pro: it would be independent from the existing program flow
> con: doesn't provided up-to-date information.
> con: it uses more resources to scan the whole db frequently
> 3. During the startup we can iterate over the whole ksm.db and count the 
> current metrics, and later we can update the numbers in case of new 
> create/delete calls. It uses additional resources during the startup (should 
> be checked how much time is to parse a db with millions of keys) but after 
> that it would be fast. Also we can introduce new confguration variables to 
> skip the initial scan. In that case the numbers will be valid only from the 
> last restart but the startup would be fast.
> I suggest to use the 3rd approach, could you please comment about your 
> opinion? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to