[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

Ilias Stamatis Mon, 21 Aug 2017 05:04:45 -0700

2017-08-17 3:55 GMT+03:00 William Brown <[email protected]>:

> On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:
> > 2017-08-15 9:56 GMT+03:00 William Brown <[email protected]>:
> >
> > > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> > > > Hi everybody,
> > > >
> > > > Following Ludwig's and Mark's suggestions on how to perform a
> database
> > > dump
> > > > in LDIF format from dbscan, I have come up with a strategy. I'm
> talking
> > > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> > > >
> > > > I'd like to discuss this strategy and get some feedback.
> > > >
> > > > The general idea is:
> > > >
> > > > - We are cursing though id2entry.db printing entries in order.
> > > > - Parents should be printed before children.
> > > > - Hence if for some entry parenitd > entryid, we have to print its
> parent
> > > > first (out of order) and track that we did so.
> > > > - In order to do the above, we don't need to move the db cursor. We
> can
> > > > just go fetch something from a random place in the db and the cursor
> will
> > > > remain in its place, so we can continue from where we left off after
> > > we're
> > > > done with printing a parent.
> > > > - We also need to construct and print the DN of each entry using its
> RDN
> > > > and the DN of its father.
> > > > - Let's use something like a hash map to pair entryids with dns
> (only for
> > > > entries that has children), e.g. map[1] = "dc=example,dc=com",
> map[4] =
> > > > "ou=People,dc=example,dc=com", etc.
> > > >
> > > > I'll present the algorithm that I came up with in python-like
> > > pseudo-code.
> > > >
> > > > First, the following function constructs the entry's DN and updates
> the
> > > > hash map if needed. We can know whether an entry is a parent or not,
> by
> > > the
> > > > presence of the numSubordinates attribute.
> > > >
> > > > # assumes that parentid already exists in the map
> > > > function display_entry(e, map):
> > > >     if not e.parentid:
> > > >         e.dn = e.rdn
> > > >     else:
> > > >         e.dn = e.rdn + map[e.parentid]
> > > >     if isparent(e):
> > > >         map[e.entryid] = e.dn
> > > >     print_to_ldif_format(e)
> > > >
> > > > Then, the main loop:
> > > >
> > > > map = new(hashmap)
> > > > printed_in_advance = []
> > > >
> > > > for e in entries:
> > > >     if e.entryid in printed_in_advance:
> > > >         continue # skip this, we have already displayed it
> > > >
> > > >     if e.parentid < e.entryid:
> > > >         display_entry(e, map)
> > > >
> > > >     else:
> > > >         # we need to display parents before children
> > > >
> > > >         list_of_parents = []
> > > >
> > > >         p = e
> > > >         while p.parentid > p.entryid:
> > > >             p = get_from_db(key = p.parentid)
> > > >             list_of_parents.append(p) # see note below (*)
> > > >
> > > >         for p in reversed(list_of_parents):
> > > >             display_entry(p, map)
> > > >             printed_in_advance.append(p.entryid)
> > > >
> > > >
> > > > * here we store the whole entry in the list (aka its data) and not
> just
> > > > its id, in order to avoid fetching it from the database again
> > > >
> > > > As a map, we can use a B+ tree implementation from libsds.
> > > >
> > > > I would like to know how the above idea sounds to you before going
> ahead
> > > > and implementing it.
> > > >
> > >
> > > Looks like a pretty good plan to me.
> > >
> > > What happens if you have this situation?
> > >
> > > rdn: a
> > > entryid: 1
> > > parentid: 2
> > >
> > > rdn: b
> > > entryid: 2
> > > parentid: 3
> > >
> > > rdn: c
> > > entryid: 3
> > > parentid: 4
> > >
> > > etc.
> > >
> > > Imagine we have to continue to work up ids to get to the parent. Can
> > > your algo handle this case?
> > >
> >
> > Unless I'm mistaken, yes. Provided of course that there is a root entry
> > which has no parentid.
> >
> > This is the part that handles this:
> >
> >          p = e
> >          while p.parentid > p.entryid:
> >              p = get_from_db(key = p.parentid)
> >              list_of_parents.append(p)
> >
> > We may jump at a few random points in the db for this and we will have to
> > keep just a few entries in memory. But I think this number of entries is
> > always going to be very small e.g. 1-2 or maybe even 7-8 in more
> "extreme"
> > cases, but never more than let's say 15. I might as well be completely
> > wrong about this, so, please correct me if that's the case.
> >
> >
> > > It seems like the way you could approach this is to sort the id order
> > > you need to display them in *before* you start parsing entries. We can
> > > file allids in memory anyway, because it's just a set of 32bit ints, so
> > > even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
> > > So I would approach this as an exercise for a comparison function to
> the
> > > set.
> > >
> > > universal_entry_set = [....] # entryrdn / id2entry
> > > # The list of entries to print
> > > print_order = []
> > >
> > > for e in universal_entry_set:
> > >     printer_order_insert_sorted(e)
> > >
> > > So we can imagine that this would invoke a sorting function. Something
> > > that would work is:
> > >
> > > compare(e1, e2):
> > >     # if e1 parent is 0, return -1
> > >     # if e2 parent is 0, return 1
> > >     # If e1 is parent to e2, return -1
> > >     # If e2 is parent to e1, return 1
> > >     # return compare of eid.
> > >
> > > Then this would create a list in order of parent relationships, and you
> > > can just do:
> > >
> > > for e in print_order:
> > >
> > > Which despite jumping about in the cursor, will print in order.
> > >
> > > So you'll need to look at entryrdn once, and then id2entry once for
> > > this.
> > >
> > > If you want to do this without entryrdn, you'll need to pass id2entry
> > > twice. But You'll only need 1 entry in memory at a time to achieve it I
> > > think. It also doesn't matter about order of ids at all here.
> > >
> >
> > Hmm, I just didn't understand what's wrong with what I proposed
> previously.
> > As you said with the technique you just described we either have to pass
> > both entryrdn and id2entry once, or pass id2entry twice in any case. With
> > what I described previously we have to pass id2entry only once (and
> > probably for a few entries we will have to skip them if we come across
> them
> > a second time).
> >
> > Could you explain why do you think the first one would be better / more
> > efficient?
> >
>
> I think your algo doesn't handle the case where you have *multiple* outo
> of order parents. Even though you have to do two passes, the way I
> proposed will guaranteed correct display regardless of id number because
> they are solely sorted on parent relationships.
>


I insist that it does. Please, allow me to use an example case to
illustrate this.


>
> I think your one will get to
>
> > >         while p.parentid > p.entryid:
> > > >             p = get_from_db(key = p.parentid)
> > > >             list_of_parents.append(p) # see note below (*)
> > > >
> > > >         for p in reversed(list_of_parents):
> > > >             display_entry(p, map)
> > > >             printed_in_advance.append(p.entryid)
>
> ^ Here, what if the parent's parent is not yet printed? display_entry
> would need to be recursive to handle this (and it appears to be
> iterative).
>

Say you have the following case:

id 1 rdn: A
entry: 1

id 2 rdn: B
parentid: 1

id 3 rdn: C
parentid: 4

id 4 rdn: D
parentid: 5

id 5 rdn: E
parentid: 1

id 6 rdn: F
parentid: 1

The algorithm will first print the first 2 entries without entering the
else clause of the loop because we don't have out of order entries.

Then when it comes to entry C it will enter the else clause. The while loop
you quoted will create this list:

list_of_parents = [C, D, E]

Then the next for loop calls display_entry(), but after *reversing* the
list. So it will call display_entry() on [E, D, C]. E can be displayed
since it's parent has already been displayd. Then D can be displayed since
E has already been displayed, etc.

We have also added entries D and E in the printed_in_advance list.

We now exit the else clause and we can continue with the outer for loop.

We have stopped in entry C so now we examine entry D. We see that we have
already printed that (because it's in printed_in_advance list) so we skip
this entry. We also skip entry E for the same reason. Then we reach entry F
and we call display_entry to it since it's not out of order.

Does it make sense?


>
> Does that help?
>
> --
> Sincerely,
>
> William Brown
> Software Engineer
> Red Hat, Australia/Brisbane
>
>
> _______________________________________________
> 389-devel mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
>

2017-08-17 3:55 GMT+03:00 William Brown <[email protected]>:

> On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:
> > 2017-08-15 9:56 GMT+03:00 William Brown <[email protected]>:
> >
> > > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> > > > Hi everybody,
> > > >
> > > > Following Ludwig's and Mark's suggestions on how to perform a
> database
> > > dump
> > > > in LDIF format from dbscan, I have come up with a strategy. I'm
> talking
> > > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> > > >
> > > > I'd like to discuss this strategy and get some feedback.
> > > >
> > > > The general idea is:
> > > >
> > > > - We are cursing though id2entry.db printing entries in order.
> > > > - Parents should be printed before children.
> > > > - Hence if for some entry parenitd > entryid, we have to print its
> parent
> > > > first (out of order) and track that we did so.
> > > > - In order to do the above, we don't need to move the db cursor. We
> can
> > > > just go fetch something from a random place in the db and the cursor
> will
> > > > remain in its place, so we can continue from where we left off after
> > > we're
> > > > done with printing a parent.
> > > > - We also need to construct and print the DN of each entry using its
> RDN
> > > > and the DN of its father.
> > > > - Let's use something like a hash map to pair entryids with dns
> (only for
> > > > entries that has children), e.g. map[1] = "dc=example,dc=com",
> map[4] =
> > > > "ou=People,dc=example,dc=com", etc.
> > > >
> > > > I'll present the algorithm that I came up with in python-like
> > > pseudo-code.
> > > >
> > > > First, the following function constructs the entry's DN and updates
> the
> > > > hash map if needed. We can know whether an entry is a parent or not,
> by
> > > the
> > > > presence of the numSubordinates attribute.
> > > >
> > > > # assumes that parentid already exists in the map
> > > > function display_entry(e, map):
> > > >     if not e.parentid:
> > > >         e.dn = e.rdn
> > > >     else:
> > > >         e.dn = e.rdn + map[e.parentid]
> > > >     if isparent(e):
> > > >         map[e.entryid] = e.dn
> > > >     print_to_ldif_format(e)
> > > >
> > > > Then, the main loop:
> > > >
> > > > map = new(hashmap)
> > > > printed_in_advance = []
> > > >
> > > > for e in entries:
> > > >     if e.entryid in printed_in_advance:
> > > >         continue # skip this, we have already displayed it
> > > >
> > > >     if e.parentid < e.entryid:
> > > >         display_entry(e, map)
> > > >
> > > >     else:
> > > >         # we need to display parents before children
> > > >
> > > >         list_of_parents = []
> > > >
> > > >         p = e
> > > >         while p.parentid > p.entryid:
> > > >             p = get_from_db(key = p.parentid)
> > > >             list_of_parents.append(p) # see note below (*)
> > > >
> > > >         for p in reversed(list_of_parents):
> > > >             display_entry(p, map)
> > > >             printed_in_advance.append(p.entryid)
> > > >
> > > >
> > > > * here we store the whole entry in the list (aka its data) and not
> just
> > > > its id, in order to avoid fetching it from the database again
> > > >
> > > > As a map, we can use a B+ tree implementation from libsds.
> > > >
> > > > I would like to know how the above idea sounds to you before going
> ahead
> > > > and implementing it.
> > > >
> > >
> > > Looks like a pretty good plan to me.
> > >
> > > What happens if you have this situation?
> > >
> > > rdn: a
> > > entryid: 1
> > > parentid: 2
> > >
> > > rdn: b
> > > entryid: 2
> > > parentid: 3
> > >
> > > rdn: c
> > > entryid: 3
> > > parentid: 4
> > >
> > > etc.
> > >
> > > Imagine we have to continue to work up ids to get to the parent. Can
> > > your algo handle this case?
> > >
> >
> > Unless I'm mistaken, yes. Provided of course that there is a root entry
> > which has no parentid.
> >
> > This is the part that handles this:
> >
> >          p = e
> >          while p.parentid > p.entryid:
> >              p = get_from_db(key = p.parentid)
> >              list_of_parents.append(p)
> >
> > We may jump at a few random points in the db for this and we will have to
> > keep just a few entries in memory. But I think this number of entries is
> > always going to be very small e.g. 1-2 or maybe even 7-8 in more
> "extreme"
> > cases, but never more than let's say 15. I might as well be completely
> > wrong about this, so, please correct me if that's the case.
> >
> >
> > > It seems like the way you could approach this is to sort the id order
> > > you need to display them in *before* you start parsing entries. We can
> > > file allids in memory anyway, because it's just a set of 32bit ints, so
> > > even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
> > > So I would approach this as an exercise for a comparison function to
> the
> > > set.
> > >
> > > universal_entry_set = [....] # entryrdn / id2entry
> > > # The list of entries to print
> > > print_order = []
> > >
> > > for e in universal_entry_set:
> > >     printer_order_insert_sorted(e)
> > >
> > > So we can imagine that this would invoke a sorting function. Something
> > > that would work is:
> > >
> > > compare(e1, e2):
> > >     # if e1 parent is 0, return -1
> > >     # if e2 parent is 0, return 1
> > >     # If e1 is parent to e2, return -1
> > >     # If e2 is parent to e1, return 1
> > >     # return compare of eid.
> > >
> > > Then this would create a list in order of parent relationships, and you
> > > can just do:
> > >
> > > for e in print_order:
> > >
> > > Which despite jumping about in the cursor, will print in order.
> > >
> > > So you'll need to look at entryrdn once, and then id2entry once for
> > > this.
> > >
> > > If you want to do this without entryrdn, you'll need to pass id2entry
> > > twice. But You'll only need 1 entry in memory at a time to achieve it I
> > > think. It also doesn't matter about order of ids at all here.
> > >
> >
> > Hmm, I just didn't understand what's wrong with what I proposed
> previously.
> > As you said with the technique you just described we either have to pass
> > both entryrdn and id2entry once, or pass id2entry twice in any case. With
> > what I described previously we have to pass id2entry only once (and
> > probably for a few entries we will have to skip them if we come across
> them
> > a second time).
> >
> > Could you explain why do you think the first one would be better / more
> > efficient?
> >
>
> I think your algo doesn't handle the case where you have *multiple* outo
> of order parents. Even though you have to do two passes, the way I
> proposed will guaranteed correct display regardless of id number because
> they are solely sorted on parent relationships.
>
> I think your one will get to
>
> > >         while p.parentid > p.entryid:
> > > >             p = get_from_db(key = p.parentid)
> > > >             list_of_parents.append(p) # see note below (*)
> > > >
> > > >         for p in reversed(list_of_parents):
> > > >             display_entry(p, map)
> > > >             printed_in_advance.append(p.entryid)
>
> ^ Here, what if the parent's parent is not yet printed? display_entry
> would need to be recursive to handle this (and it appears to be
> iterative).
>
> Does that help?
>
> --
> Sincerely,
>
> William Brown
> Software Engineer
> Red Hat, Australia/Brisbane
>
>
> _______________________________________________
> 389-devel mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
>

_______________________________________________
389-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

Reply via email to