2017-08-17 3:55 GMT+03:00 William Brown <wibr...@redhat.com>: > On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote: > > 2017-08-15 9:56 GMT+03:00 William Brown <wibr...@redhat.com>: > > > > > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote: > > > > Hi everybody, > > > > > > > > Following Ludwig's and Mark's suggestions on how to perform a > database > > > dump > > > > in LDIF format from dbscan, I have come up with a strategy. I'm > talking > > > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567 > > > > > > > > I'd like to discuss this strategy and get some feedback. > > > > > > > > The general idea is: > > > > > > > > - We are cursing though id2entry.db printing entries in order. > > > > - Parents should be printed before children. > > > > - Hence if for some entry parenitd > entryid, we have to print its > parent > > > > first (out of order) and track that we did so. > > > > - In order to do the above, we don't need to move the db cursor. We > can > > > > just go fetch something from a random place in the db and the cursor > will > > > > remain in its place, so we can continue from where we left off after > > > we're > > > > done with printing a parent. > > > > - We also need to construct and print the DN of each entry using its > RDN > > > > and the DN of its father. > > > > - Let's use something like a hash map to pair entryids with dns > (only for > > > > entries that has children), e.g. map[1] = "dc=example,dc=com", > map[4] = > > > > "ou=People,dc=example,dc=com", etc. > > > > > > > > I'll present the algorithm that I came up with in python-like > > > pseudo-code. > > > > > > > > First, the following function constructs the entry's DN and updates > the > > > > hash map if needed. We can know whether an entry is a parent or not, > by > > > the > > > > presence of the numSubordinates attribute. > > > > > > > > # assumes that parentid already exists in the map > > > > function display_entry(e, map): > > > > if not e.parentid: > > > > e.dn = e.rdn > > > > else: > > > > e.dn = e.rdn + map[e.parentid] > > > > if isparent(e): > > > > map[e.entryid] = e.dn > > > > print_to_ldif_format(e) > > > > > > > > Then, the main loop: > > > > > > > > map = new(hashmap) > > > > printed_in_advance = [] > > > > > > > > for e in entries: > > > > if e.entryid in printed_in_advance: > > > > continue # skip this, we have already displayed it > > > > > > > > if e.parentid < e.entryid: > > > > display_entry(e, map) > > > > > > > > else: > > > > # we need to display parents before children > > > > > > > > list_of_parents = [] > > > > > > > > p = e > > > > while p.parentid > p.entryid: > > > > p = get_from_db(key = p.parentid) > > > > list_of_parents.append(p) # see note below (*) > > > > > > > > for p in reversed(list_of_parents): > > > > display_entry(p, map) > > > > printed_in_advance.append(p.entryid) > > > > > > > > > > > > * here we store the whole entry in the list (aka its data) and not > just > > > > its id, in order to avoid fetching it from the database again > > > > > > > > As a map, we can use a B+ tree implementation from libsds. > > > > > > > > I would like to know how the above idea sounds to you before going > ahead > > > > and implementing it. > > > > > > > > > > Looks like a pretty good plan to me. > > > > > > What happens if you have this situation? > > > > > > rdn: a > > > entryid: 1 > > > parentid: 2 > > > > > > rdn: b > > > entryid: 2 > > > parentid: 3 > > > > > > rdn: c > > > entryid: 3 > > > parentid: 4 > > > > > > etc. > > > > > > Imagine we have to continue to work up ids to get to the parent. Can > > > your algo handle this case? > > > > > > > Unless I'm mistaken, yes. Provided of course that there is a root entry > > which has no parentid. > > > > This is the part that handles this: > > > > p = e > > while p.parentid > p.entryid: > > p = get_from_db(key = p.parentid) > > list_of_parents.append(p) > > > > We may jump at a few random points in the db for this and we will have to > > keep just a few entries in memory. But I think this number of entries is > > always going to be very small e.g. 1-2 or maybe even 7-8 in more > "extreme" > > cases, but never more than let's say 15. I might as well be completely > > wrong about this, so, please correct me if that's the case. > > > > > > > It seems like the way you could approach this is to sort the id order > > > you need to display them in *before* you start parsing entries. We can > > > file allids in memory anyway, because it's just a set of 32bit ints, so > > > even at 50,000,000 entries, this only takes 190MB of ram to fit allids. > > > So I would approach this as an exercise for a comparison function to > the > > > set. > > > > > > universal_entry_set = [....] # entryrdn / id2entry > > > # The list of entries to print > > > print_order = [] > > > > > > for e in universal_entry_set: > > > printer_order_insert_sorted(e) > > > > > > So we can imagine that this would invoke a sorting function. Something > > > that would work is: > > > > > > compare(e1, e2): > > > # if e1 parent is 0, return -1 > > > # if e2 parent is 0, return 1 > > > # If e1 is parent to e2, return -1 > > > # If e2 is parent to e1, return 1 > > > # return compare of eid. > > > > > > Then this would create a list in order of parent relationships, and you > > > can just do: > > > > > > for e in print_order: > > > > > > Which despite jumping about in the cursor, will print in order. > > > > > > So you'll need to look at entryrdn once, and then id2entry once for > > > this. > > > > > > If you want to do this without entryrdn, you'll need to pass id2entry > > > twice. But You'll only need 1 entry in memory at a time to achieve it I > > > think. It also doesn't matter about order of ids at all here. > > > > > > > Hmm, I just didn't understand what's wrong with what I proposed > previously. > > As you said with the technique you just described we either have to pass > > both entryrdn and id2entry once, or pass id2entry twice in any case. With > > what I described previously we have to pass id2entry only once (and > > probably for a few entries we will have to skip them if we come across > them > > a second time). > > > > Could you explain why do you think the first one would be better / more > > efficient? > > > > I think your algo doesn't handle the case where you have *multiple* outo > of order parents. Even though you have to do two passes, the way I > proposed will guaranteed correct display regardless of id number because > they are solely sorted on parent relationships. >
I insist that it does. Please, allow me to use an example case to illustrate this. > > I think your one will get to > > > > while p.parentid > p.entryid: > > > > p = get_from_db(key = p.parentid) > > > > list_of_parents.append(p) # see note below (*) > > > > > > > > for p in reversed(list_of_parents): > > > > display_entry(p, map) > > > > printed_in_advance.append(p.entryid) > > ^ Here, what if the parent's parent is not yet printed? display_entry > would need to be recursive to handle this (and it appears to be > iterative). > Say you have the following case: id 1 rdn: A entry: 1 id 2 rdn: B parentid: 1 id 3 rdn: C parentid: 4 id 4 rdn: D parentid: 5 id 5 rdn: E parentid: 1 id 6 rdn: F parentid: 1 The algorithm will first print the first 2 entries without entering the else clause of the loop because we don't have out of order entries. Then when it comes to entry C it will enter the else clause. The while loop you quoted will create this list: list_of_parents = [C, D, E] Then the next for loop calls display_entry(), but after *reversing* the list. So it will call display_entry() on [E, D, C]. E can be displayed since it's parent has already been displayd. Then D can be displayed since E has already been displayed, etc. We have also added entries D and E in the printed_in_advance list. We now exit the else clause and we can continue with the outer for loop. We have stopped in entry C so now we examine entry D. We see that we have already printed that (because it's in printed_in_advance list) so we skip this entry. We also skip entry E for the same reason. Then we reach entry F and we call display_entry to it since it's not out of order. Does it make sense? > > Does that help? > > -- > Sincerely, > > William Brown > Software Engineer > Red Hat, Australia/Brisbane > > > _______________________________________________ > 389-devel mailing list -- 389-devel@lists.fedoraproject.org > To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org > > 2017-08-17 3:55 GMT+03:00 William Brown <wibr...@redhat.com>: > On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote: > > 2017-08-15 9:56 GMT+03:00 William Brown <wibr...@redhat.com>: > > > > > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote: > > > > Hi everybody, > > > > > > > > Following Ludwig's and Mark's suggestions on how to perform a > database > > > dump > > > > in LDIF format from dbscan, I have come up with a strategy. I'm > talking > > > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567 > > > > > > > > I'd like to discuss this strategy and get some feedback. > > > > > > > > The general idea is: > > > > > > > > - We are cursing though id2entry.db printing entries in order. > > > > - Parents should be printed before children. > > > > - Hence if for some entry parenitd > entryid, we have to print its > parent > > > > first (out of order) and track that we did so. > > > > - In order to do the above, we don't need to move the db cursor. We > can > > > > just go fetch something from a random place in the db and the cursor > will > > > > remain in its place, so we can continue from where we left off after > > > we're > > > > done with printing a parent. > > > > - We also need to construct and print the DN of each entry using its > RDN > > > > and the DN of its father. > > > > - Let's use something like a hash map to pair entryids with dns > (only for > > > > entries that has children), e.g. map[1] = "dc=example,dc=com", > map[4] = > > > > "ou=People,dc=example,dc=com", etc. > > > > > > > > I'll present the algorithm that I came up with in python-like > > > pseudo-code. > > > > > > > > First, the following function constructs the entry's DN and updates > the > > > > hash map if needed. We can know whether an entry is a parent or not, > by > > > the > > > > presence of the numSubordinates attribute. > > > > > > > > # assumes that parentid already exists in the map > > > > function display_entry(e, map): > > > > if not e.parentid: > > > > e.dn = e.rdn > > > > else: > > > > e.dn = e.rdn + map[e.parentid] > > > > if isparent(e): > > > > map[e.entryid] = e.dn > > > > print_to_ldif_format(e) > > > > > > > > Then, the main loop: > > > > > > > > map = new(hashmap) > > > > printed_in_advance = [] > > > > > > > > for e in entries: > > > > if e.entryid in printed_in_advance: > > > > continue # skip this, we have already displayed it > > > > > > > > if e.parentid < e.entryid: > > > > display_entry(e, map) > > > > > > > > else: > > > > # we need to display parents before children > > > > > > > > list_of_parents = [] > > > > > > > > p = e > > > > while p.parentid > p.entryid: > > > > p = get_from_db(key = p.parentid) > > > > list_of_parents.append(p) # see note below (*) > > > > > > > > for p in reversed(list_of_parents): > > > > display_entry(p, map) > > > > printed_in_advance.append(p.entryid) > > > > > > > > > > > > * here we store the whole entry in the list (aka its data) and not > just > > > > its id, in order to avoid fetching it from the database again > > > > > > > > As a map, we can use a B+ tree implementation from libsds. > > > > > > > > I would like to know how the above idea sounds to you before going > ahead > > > > and implementing it. > > > > > > > > > > Looks like a pretty good plan to me. > > > > > > What happens if you have this situation? > > > > > > rdn: a > > > entryid: 1 > > > parentid: 2 > > > > > > rdn: b > > > entryid: 2 > > > parentid: 3 > > > > > > rdn: c > > > entryid: 3 > > > parentid: 4 > > > > > > etc. > > > > > > Imagine we have to continue to work up ids to get to the parent. Can > > > your algo handle this case? > > > > > > > Unless I'm mistaken, yes. Provided of course that there is a root entry > > which has no parentid. > > > > This is the part that handles this: > > > > p = e > > while p.parentid > p.entryid: > > p = get_from_db(key = p.parentid) > > list_of_parents.append(p) > > > > We may jump at a few random points in the db for this and we will have to > > keep just a few entries in memory. But I think this number of entries is > > always going to be very small e.g. 1-2 or maybe even 7-8 in more > "extreme" > > cases, but never more than let's say 15. I might as well be completely > > wrong about this, so, please correct me if that's the case. > > > > > > > It seems like the way you could approach this is to sort the id order > > > you need to display them in *before* you start parsing entries. We can > > > file allids in memory anyway, because it's just a set of 32bit ints, so > > > even at 50,000,000 entries, this only takes 190MB of ram to fit allids. > > > So I would approach this as an exercise for a comparison function to > the > > > set. > > > > > > universal_entry_set = [....] # entryrdn / id2entry > > > # The list of entries to print > > > print_order = [] > > > > > > for e in universal_entry_set: > > > printer_order_insert_sorted(e) > > > > > > So we can imagine that this would invoke a sorting function. Something > > > that would work is: > > > > > > compare(e1, e2): > > > # if e1 parent is 0, return -1 > > > # if e2 parent is 0, return 1 > > > # If e1 is parent to e2, return -1 > > > # If e2 is parent to e1, return 1 > > > # return compare of eid. > > > > > > Then this would create a list in order of parent relationships, and you > > > can just do: > > > > > > for e in print_order: > > > > > > Which despite jumping about in the cursor, will print in order. > > > > > > So you'll need to look at entryrdn once, and then id2entry once for > > > this. > > > > > > If you want to do this without entryrdn, you'll need to pass id2entry > > > twice. But You'll only need 1 entry in memory at a time to achieve it I > > > think. It also doesn't matter about order of ids at all here. > > > > > > > Hmm, I just didn't understand what's wrong with what I proposed > previously. > > As you said with the technique you just described we either have to pass > > both entryrdn and id2entry once, or pass id2entry twice in any case. With > > what I described previously we have to pass id2entry only once (and > > probably for a few entries we will have to skip them if we come across > them > > a second time). > > > > Could you explain why do you think the first one would be better / more > > efficient? > > > > I think your algo doesn't handle the case where you have *multiple* outo > of order parents. Even though you have to do two passes, the way I > proposed will guaranteed correct display regardless of id number because > they are solely sorted on parent relationships. > > I think your one will get to > > > > while p.parentid > p.entryid: > > > > p = get_from_db(key = p.parentid) > > > > list_of_parents.append(p) # see note below (*) > > > > > > > > for p in reversed(list_of_parents): > > > > display_entry(p, map) > > > > printed_in_advance.append(p.entryid) > > ^ Here, what if the parent's parent is not yet printed? display_entry > would need to be recursive to handle this (and it appears to be > iterative). > > Does that help? > > -- > Sincerely, > > William Brown > Software Engineer > Red Hat, Australia/Brisbane > > > _______________________________________________ > 389-devel mailing list -- 389-devel@lists.fedoraproject.org > To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org > >
_______________________________________________ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org