[389-devel] Re: Trying to understand entryrdn.db
2017-08-04 16:03 GMT+03:00 Ludwig Krispenz: > > On 08/04/2017 02:08 PM, Ilias Stamatis wrote: > > Okay, now that I have read and understood dbscan's code, I have a few more > questions. > > 2017-08-03 10:10 GMT+03:00 Ludwig Krispenz : > > >> Hi, now that I know the context here are some more comments. >> If the purpose is to create a useful ldif file, which could eventually be >> used for import then formatting an entry correctly is not enough. Order of >> entries matters: parents need to come before children. We already handle >> this in db2ldif or replication total update. >> That said, whenever you write an entry you always have seen the parent >> and could stack the dn with the parentid and createt the dn without using >> the entryrdn index. >> You even need not to keep track of all the entry rdsn/dns - only the ones >> with children will be needed later, the presence of "numsubordinates" >> identifies a parent. >> > > Is it guaranteed that parents are going to appear before children in > id2entry.db? > > no. that's what I said before, it is possible that parentid > entryid. It > happens if an entry is moved by modrdn to aother subtree > Ooh, you're right. I got confused, sorry. I'm also having a hard time finding where this functionality is implemented in db2ldif. :/ If I tried to do it "from scratch", I think we go back to this (because we need to grab something that is located after where the cursor is currently pointing): On 08/02/2017 09:12 PM, Mark Reynolds wrote: I have not looked closely into it - so it might not be necessary to use > entryrdn. I thought it might be more efficient to use it. If you just use > id2entry, you have to keep scanning it over and over, and starting over > every time you need to read the next entry. Maybe not though, maybe you > can just "search" it and not have to scan it sequentially when trying to > find parents and entries. I'll leave that up to you to find out ;-) > BDB has this method: https://docs.oracle.com/cd/ E17275_01/html/api_reference/C/dbget.html It allows you to retrieve a key / data pair directly, without a need for iterating over cursor->c_get(cursor, , , DB_NEXT). The thing is that I don't know how it is implemented. Does it scan the DB sequentially or or is it faster than that (I hope and guess it's the latter)? If it's not that efficient, maybe it does make sense to use entryrdn instead finally? ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
[389-devel] Re: Trying to understand entryrdn.db
On 08/04/2017 02:08 PM, Ilias Stamatis wrote: Okay, now that I have read and understood dbscan's code, I have a few more questions. 2017-08-03 10:10 GMT+03:00 Ludwig Krispenz>: Hi, now that I know the context here are some more comments. If the purpose is to create a useful ldif file, which could eventually be used for import then formatting an entry correctly is not enough. Order of entries matters: parents need to come before children. We already handle this in db2ldif or replication total update. That said, whenever you write an entry you always have seen the parent and could stack the dn with the parentid and createt the dn without using the entryrdn index. You even need not to keep track of all the entry rdsn/dns - only the ones with children will be needed later, the presence of "numsubordinates" identifies a parent. Is it guaranteed that parents are going to appear before children in id2entry.db? no. that's what I said before, it is possible that parentid > entryid. It happens if an entry is moved by modrdn to aother subtree If so, here's what could probably work: - Start reading entries from id2entry sequentially. - For each entry, if it has a numSubordinates attribute it means it is a parent for other entries. So we can store it's ID - DN pair in a hash map. - For entries that they have a parentid and so we need to figure out their parent's DN, we just look for hashmap[parentid]. To make it even more efficient (if really needed though, because it will make things more complicated) we can store the value of numSubordinates with each parent as well somehow in the map. Every time a parentid is looked in the map we can decrease the value of numSubordinates by 1. When it becomes 0, it means there are no more children of this ID so we can safely remove it from the map. However, I don't know if we would really need this last thing. In a 100 million entry db how many parents would we expect to have approximately? Also, do we have a hash map implemented somewhere? If parents are not guaranteed to appear before children in id2entry.db, then we would have to alter the above strategy. Thanks! ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org -- Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric Shander ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
[389-devel] Re: Trying to understand entryrdn.db
Ok, thanks for the update. > > On Aug 4, 2017 at 08:08, mailto:stamatis.ili...@gmail.com)> > wrote: > > > > Okay, now that I have read and understood dbscan's code, I have a few more > questions. > > > > > 2017-08-03 10:10 GMT+03:00 Ludwig Krispenz(mailto:lkris...@redhat.com)>: > > > Hi, now that I know the context here are some more comments. > > > > If the purpose is to create a useful ldif file, which could eventually be > > used for import then formatting an entry correctly is not enough. Order of > > entries matters: parents need to come before children. We already handle > > this in db2ldif or replication total update. > > That said, whenever you write an entry you always have seen the parent and > > could stack the dn with the parentid and createt the dn without using the > > entryrdn index. > > You even need not to keep track of all the entry rdsn/dns - only the ones > > with children will be needed later, the presence of "numsubordinates" > > identifies a parent. > > > > > Is it guaranteed that parents are going to appear before children in > id2entry.db? > > > If so, here's what could probably work: > > > - Start reading entries from id2entry sequentially. > > - For each entry, if it has a numSubordinates attribute it means it is a > parent for other entries. So we can store it's ID - DN pair in a hash map. > > - For entries that they have a parentid and so we need to figure out their > parent's DN, we just look for hashmap[parentid]. > > > To make it even more efficient (if really needed though, because it will make > things more complicated) we can store the value of numSubordinates with each > parent as well somehow in the map. Every time a parentid is looked in the map > we can decrease the value of numSubordinates by 1. When it becomes 0, it > means there are no more children of this ID so we can safely remove it from > the map. > > > However, I don't know if we would really need this last thing. In a 100 > million entry db how many parents would we expect to have approximately? > > > Also, do we have a hash map implemented somewhere? > > > If parents are not guaranteed to appear before children in id2entry.db, then > we would have to alter the above strategy. > > > > Thanks! > > >___ 389-devel mailing list > -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to > 389-devel-le...@lists.fedoraproject.org___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
[389-devel] Re: Trying to understand entryrdn.db
Let's discuss more on it. > > On Aug 3, 2017 at 07:33, mailto:lkris...@redhat.com)> > wrote: > > > > > On 08/03/2017 12:24 PM, Ilias Stamatis wrote: > > > > > > > > > > That said, whenever you write an entry you always have seen the parent > > > and could stack the dn with the parentid and createt the dn without using > > > the entryrdn index. > > > You even need not to keep track of all the entry rdsn/dns - only the > > > ones with children will be needed later, the presence of "numsubordinates" > > > identifies a parent. > > > > > > > > Interesting. I think I now understand better how to approach this problem. > >great. just one more hint. If you iterate the the entries in > > id2entry you have the entryid and the parentid of the entry. if parentid > > > entryid you need to get and export the parent first (an track that you did > > it already) > > > > > > > > > I'll get back to it soon. > > > > Thanks so much! > > > > > > > Last but not least, since I think dbscan is broken for entryrdn, > > investigating and fixing this would also be nice > > > > Sure. I'll open a ticket so it gets tracked. > > > > > > ___ 389-devel mailing list -- > > 389-devel@lists.fedoraproject.org > > (mailto:389-devel@lists.fedoraproject.org) To unsubscribe send an email to > > 389-devel-le...@lists.fedoraproject.org > > (mailto:389-devel-le...@lists.fedoraproject.org) > > > > -- Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, > Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: > Charles Cachera, Michael Cunningham, Michael O'Neill, Eric Shander > ___ 389-devel mailing list -- > 389-devel@lists.fedoraproject.org To unsubscribe send an email to > 389-devel-le...@lists.fedoraproject.org > ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
[389-devel] Re: Trying to understand entryrdn.db
> > That said, whenever you write an entry you always have seen the parent and > could stack the dn with the parentid and createt the dn without using the > entryrdn index. > You even need not to keep track of all the entry rdsn/dns - only the ones > with children will be needed later, the presence of "numsubordinates" > identifies a parent. > Interesting. I think I now understand better how to approach this problem. I'll get back to it soon. Thanks so much! > Last but not least, since I think dbscan is broken for entryrdn, investigating and fixing this would also be nice Sure. I'll open a ticket so it gets tracked. ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
[389-devel] Re: Trying to understand entryrdn.db
On 08/02/2017 09:12 PM, Mark Reynolds wrote: On 08/02/2017 02:19 PM, Ilias Stamatis wrote: I see now, thank you both very much! Follow-up: [1] Get entry from id2entry and use its ID [2] Look in entryrdn for the parent of the ID [3] Keep looking for parents, building the DN as you go along Example: [1] Get entry from id2entry: ID 6 --> "cn=Accounting Managers" [2] Check entryrdn for "P". In this case it's "P6" which is "ou=Groups" with ID 3 [3] So find "P3", which is "dc=example,dc=com" with ID 1, and look for "P1". But there is no P1, so we stop the process/loop. Why do we need to look at entryrdn for parent's id? Is it faster? I have not looked closely into it - so it might not be necessary to use entryrdn. I thought it might be more efficient to use it. If you just use id2entry, you have to keep scanning it over and over, and starting over every time you need to read the next entry. Maybe not though, maybe you can just "search" it and not have to scan it sequentially when trying to find parents and entries. I'll leave that up to you to find out ;-) Hi, now that I know the context here are some more comments. If the purpose is to create a useful ldif file, which could eventually be used for import then formatting an entry correctly is not enough. Order of entries matters: parents need to come before children. We already handle this in db2ldif or replication total update. That said, whenever you write an entry you always have seen the parent and could stack the dn with the parentid and createt the dn without using the entryrdn index. You even need not to keep track of all the entry rdsn/dns - only the ones with children will be needed later, the presence of "numsubordinates" identifies a parent. Last but not least, since I think dbscan is broken for entryrdn, investigating and fixing this would also be nice I mean the same information can be found in id2entry (?). Or this is not the case and dbscan does the exact same process you just described in order to print "parentid: X" for each entry when you do "dbscan -f id2entry.db"? Thanks again, ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org -- Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric Shander ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
[389-devel] Re: Trying to understand entryrdn.db
On 08/02/2017 11:49 AM, Ilias Stamatis wrote: > Hello, > > I would like some help in order to understand entryrdn.db. When I do > "dbscan -f entryrdn.db" I get something like: > > 3 > ID: 3; RDN: "ou=Groups"; NRDN: "ou=groups" > > C3 > ID: 6; RDN: "cn=Accounting Managers"; NRDN: "cn=accounting managers" > > P6 > ID: 3; RDN: "ou=Groups"; NRDN: "ou=groups" > > I understand that 3 is this entry's ID, C3 means child of entry 3 and > P6 means parent of entry 6. > > What I don't understand however is why those entries are repeated > again and again. For example " ID: 7; RDN: "cn=HR Managers"; NRDN: > "cn=hr managers" is repeated about a dozen of times in my entryrdn. > And I don't mean like a parent, child, or whatever. It is repeated > lots of time as ID 7 for example (but also many times as C3, etc.). > > I attach the complete output of what I get when I run "dbscan -f > entryrdn.db", in order to demonstrate what I mean (my db contains > almost default entries only). > > So my question is; how is this database filled? Since I know that you are tying to work on ticket https://pagure.io/389-ds-base/issue/47567, here is the short answer... This is how you use id2enty and entryrdn together to achieve the desired result... id2entry: id 1 rdn: dc=example,dc=com objectClass: top ... id 3 rdn: ou=Groups ... ... parentid: 1 id 6 rdn: cn=Accounting Managers ... ... parentid: 3 > points to ID 3 ("ou=Groups"), then "groups" parent points to ID 1 ("dc=example,dc=com"). Final result "cn=Accounting Managers, ou=Groups, dc=example,dc=com" For the purposes of the ticket listed above you need to recreate the full DN of each entry found in id2entry and then print its LDIF format. So you grab an entry from id2entry, find its parent id, then you recursively keep looking at each parent in entryrdn, building up the DN as you go, until there is no parent id found. [1] Get entry from id2entry and use its ID [2] Look in entryrdn for the parent of the ID [3] Keep looking for parents, building the DN as you go along Example: [1] Get entry from id2entry: ID 6 --> "cn=Accounting Managers" [2] Check entryrdn for "P". In this case it's "P6" which is "ou=Groups" with ID 3 [3] So find "P3", which is "dc=example,dc=com" with ID 1, and look for "P1". But there is no P1, so we stop the process/loop. Final result "cn=Accounting Managers, ou=Groups, dc=example,dc=com" We just want it to be efficient, and not use a lot of memory. This needs to work on a 100 million entry db without consuming a lot of resources. I hope that helps. Mark > > Thank you very much, > Ilias > > > ___ > 389-devel mailing list -- 389-devel@lists.fedoraproject.org > To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
[389-devel] Re: Trying to understand entryrdn.db
Hi, I think this is a problem of dbscan, which tries to prettyprint the entryrdn index and seems to loop a bit. If you do db_dump -d a entryrdn.db you get the raw contents of the file , and you get much fewer records. Ludwig On 08/02/2017 05:49 PM, Ilias Stamatis wrote: Hello, I would like some help in order to understand entryrdn.db. When I do "dbscan -f entryrdn.db" I get something like: 3 ID: 3; RDN: "ou=Groups"; NRDN: "ou=groups" C3 ID: 6; RDN: "cn=Accounting Managers"; NRDN: "cn=accounting managers" P6 ID: 3; RDN: "ou=Groups"; NRDN: "ou=groups" I understand that 3 is this entry's ID, C3 means child of entry 3 and P6 means parent of entry 6. What I don't understand however is why those entries are repeated again and again. For example " ID: 7; RDN: "cn=HR Managers"; NRDN: "cn=hr managers" is repeated about a dozen of times in my entryrdn. And I don't mean like a parent, child, or whatever. It is repeated lots of time as ID 7 for example (but also many times as C3, etc.). I attach the complete output of what I get when I run "dbscan -f entryrdn.db", in order to demonstrate what I mean (my db contains almost default entries only). So my question is; how is this database filled? Thank you very much, Ilias ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org -- Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric Shander ___ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org