[389-devel] Re: Trying to understand entryrdn.db

2017-08-04 Thread Ilias Stamatis
2017-08-04 16:03 GMT+03:00 Ludwig Krispenz :

>
> On 08/04/2017 02:08 PM, Ilias Stamatis wrote:
>
> Okay, now that I have read and understood dbscan's code, I have a few more
> questions.
>
> 2017-08-03 10:10 GMT+03:00 Ludwig Krispenz :
>
>
>> Hi, now that I know the context here are some more comments.
>> If the purpose is to create a useful ldif file, which could eventually be
>> used for import then formatting an entry correctly is not enough. Order of
>> entries matters: parents need to come before children. We already handle
>> this in db2ldif or replication total update.
>> That said, whenever you write an entry you always have seen the parent
>> and could stack the dn with the parentid and createt the dn without using
>> the entryrdn index.
>> You even need not to keep track of all the entry rdsn/dns - only the ones
>> with children will be needed later, the presence of "numsubordinates"
>> identifies a parent.
>>
>
> Is it guaranteed that parents are going to appear before children in
> id2entry.db?
>
> no. that's what I said before, it is possible that parentid > entryid. It
> happens if an entry is moved by modrdn to aother subtree
>

Ooh, you're right. I got confused, sorry.
I'm also having a hard time finding where this functionality is implemented
in db2ldif. :/

If I tried to do it "from scratch", I think we go back to this (because we
need to grab something that is located after where the cursor is currently
pointing):

On 08/02/2017 09:12 PM, Mark Reynolds wrote:

I have not looked closely into it - so it might not be necessary to use
> entryrdn.  I thought it might be more efficient to use it.  If you just use
> id2entry, you have to keep scanning it over and over, and starting over
> every time you need to read the next entry.  Maybe not though, maybe you
> can just "search" it and not have to scan it sequentially when trying to
> find parents and entries.  I'll leave that up to you to find out ;-)
>

BDB has this method: https://docs.oracle.com/cd/
E17275_01/html/api_reference/C/dbget.html
It allows you to retrieve a key / data pair directly, without a need for
iterating over cursor->c_get(cursor, , , DB_NEXT).

The thing is that I don't know how it is implemented. Does it scan the DB
sequentially or or is it faster than that (I hope and guess it's the
latter)?

If it's not that efficient, maybe it does make sense to use entryrdn
instead finally?
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Trying to understand entryrdn.db

2017-08-04 Thread Ludwig Krispenz


On 08/04/2017 02:08 PM, Ilias Stamatis wrote:
Okay, now that I have read and understood dbscan's code, I have a few 
more questions.


2017-08-03 10:10 GMT+03:00 Ludwig Krispenz >:


Hi, now that I know the context here are some more comments.
If the purpose is to create a useful ldif file, which could
eventually be used for import then formatting an entry correctly
is not enough. Order of entries matters: parents need to come
before children. We already handle this in db2ldif or replication
total update.
That said, whenever you write an entry you always have seen the
parent and could stack the dn with the parentid and createt the dn
without using the entryrdn index.
You even need not to keep track of all the entry rdsn/dns - only
the ones with children will be needed later, the presence of
"numsubordinates"
identifies a parent.


Is it guaranteed that parents are going to appear before children in 
id2entry.db?
no. that's what I said before, it is possible that parentid > entryid. 
It happens if an entry is moved by modrdn to aother subtree


If so, here's what could probably work:

- Start reading entries from id2entry sequentially.
- For each entry, if it has a numSubordinates attribute it means it is 
a parent for other entries. So we can store it's ID - DN pair in a 
hash map.
- For entries that they have a parentid and so we need to figure out 
their parent's DN, we just look for hashmap[parentid].


To make it even more efficient (if really needed though, because it 
will make things more complicated) we can store the value of 
numSubordinates with each parent as well somehow in the map. Every 
time a parentid is looked in the map we can decrease the value of 
numSubordinates by 1. When it becomes 0, it means there are no more 
children of this ID so we can safely remove it from the map.


However, I don't know if we would really need this last thing. In a 
100 million entry db how many parents would we expect to have 
approximately?


Also, do we have a hash map implemented somewhere?

If parents are not guaranteed to appear before children in 
id2entry.db, then we would have to alter the above strategy.


Thanks!



___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


--
Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric 
Shander

___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Trying to understand entryrdn.db

2017-08-04 Thread Nishan Boroian
 
 
 
Ok, thanks for the update.  
 
 
 

 
 

 
 
>  
> On Aug 4, 2017 at 08:08,  mailto:stamatis.ili...@gmail.com)> 
>  wrote:
>  
>  
>  
> Okay, now that I have read and understood dbscan's code, I have a few more 
> questions.
>  
>  
>
>  
> 2017-08-03 10:10 GMT+03:00 Ludwig Krispenz   (mailto:lkris...@redhat.com)>:
>  
> >  Hi, now that I know the context here are some more comments.
> >  
> > If the purpose is to create a useful ldif file, which could eventually be 
> > used for import then formatting an entry correctly is not enough. Order of 
> > entries matters: parents need to come before children. We already handle 
> > this in db2ldif or replication total update.
> >  That said, whenever you write an entry you always have seen the parent and 
> > could stack the dn with the parentid and createt the dn without using the 
> > entryrdn index.
> >  You even need not to keep track of all the entry rdsn/dns - only the ones 
> > with children will be needed later, the presence of "numsubordinates"
> >  identifies a parent.
> >
>
>  
> Is it guaranteed that parents are going to appear before children in 
> id2entry.db?
>  
>  
> If so, here's what could probably work:
>  
>  
> - Start reading entries from id2entry sequentially.
>  
> - For each entry, if it has a numSubordinates attribute it means it is a 
> parent for other entries. So we can store it's ID - DN pair in a hash map.
>  
> - For entries that they have a parentid and so we need to figure out their 
> parent's DN, we just look for hashmap[parentid].
>  
>  
> To make it even more efficient (if really needed though, because it will make 
> things more complicated) we can store the value of numSubordinates with each 
> parent as well somehow in the map. Every time a parentid is looked in the map 
> we can decrease the value of numSubordinates by 1. When it becomes 0, it 
> means there are no more children of this ID so we can safely remove it from 
> the map.
>  
>  
> However, I don't know if we would really need this last thing. In a 100 
> million entry db how many parents would we expect to have approximately?
>  
>  
> Also, do we have a hash map implemented somewhere?
>  
>  
> If parents are not guaranteed to appear before children in id2entry.db, then 
> we would have to alter the above strategy.
>
>
>  
> Thanks!
>  
>
>___ 389-devel mailing list 
> -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 
> 389-devel-le...@lists.fedoraproject.org___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Trying to understand entryrdn.db

2017-08-03 Thread Nishan Boroian
 
 
 
Let's discuss more on it.  
 
 
 

 
 

 
 
>  
> On Aug 3, 2017 at 07:33,  mailto:lkris...@redhat.com)>  
> wrote:
>  
>  
>  
>  
> On 08/03/2017 12:24 PM, Ilias Stamatis wrote:
>  
> >
> >  
> >  
> > > That said, whenever you write an entry you always have seen the parent 
> > > and could stack the dn with the parentid and createt the dn without using 
> > > the entryrdn index.
> > >  You even need not to keep track of all the entry rdsn/dns - only the 
> > > ones with children will be needed later, the presence of "numsubordinates"
> > >  identifies a parent.
> >  
> >
> >  
> > Interesting. I think I now understand better how to approach this problem.
> >great. just one more hint. If you iterate the the entries in 
> > id2entry you have the entryid and the parentid of the entry. if parentid  > 
> >  entryid you need to get and export the parent first (an track that you did 
> > it already)
>  
> >  
> >  
> >  
> > I'll get back to it soon.
> >  
> > Thanks so much!
> >
> >
> >   >  Last but not least, since I think dbscan is broken for entryrdn, 
> > investigating and fixing this would also be nice
> >  
> >  Sure. I'll open a ticket  so it gets tracked.  
> >
> >  
> >  ___ 389-devel mailing list --  
> > 389-devel@lists.fedoraproject.org 
> > (mailto:389-devel@lists.fedoraproject.org)  To unsubscribe send an email to 
> >  389-devel-le...@lists.fedoraproject.org 
> > (mailto:389-devel-le...@lists.fedoraproject.org)  
>
>  
>  
>  -- Red Hat GmbH,  http://www.de.redhat.com/, Registered seat: Grasbrunn, 
> Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: 
> Charles Cachera, Michael Cunningham, Michael O'Neill, Eric Shander 
>  ___ 389-devel mailing list -- 
> 389-devel@lists.fedoraproject.org To unsubscribe send an email to 
> 389-devel-le...@lists.fedoraproject.org
>  
 
 
 ___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Trying to understand entryrdn.db

2017-08-03 Thread Ilias Stamatis
>
> That said, whenever you write an entry you always have seen the parent and
> could stack the dn with the parentid and createt the dn without using the
> entryrdn index.
> You even need not to keep track of all the entry rdsn/dns - only the ones
> with children will be needed later, the presence of "numsubordinates"
> identifies a parent.
>

Interesting. I think I now understand better how to approach this problem.
I'll get back to it soon.
Thanks so much!

> Last but not least, since I think dbscan is broken for entryrdn,
investigating and fixing this would also be nice

Sure. I'll open a ticket so it gets tracked.
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Trying to understand entryrdn.db

2017-08-03 Thread Ludwig Krispenz


On 08/02/2017 09:12 PM, Mark Reynolds wrote:



On 08/02/2017 02:19 PM, Ilias Stamatis wrote:

I see now, thank you both very much!

Follow-up:

[1]  Get entry from id2entry and use its ID
[2]  Look in entryrdn for the parent of the ID
[3]  Keep looking for parents, building the DN as you go along


Example:

[1]  Get entry from id2entry:  ID 6 --> "cn=Accounting Managers"
[2]  Check entryrdn for "P".  In this case it's "P6" which is
"ou=Groups" with ID 3
[3]  So find "P3", which is "dc=example,dc=com" with ID 1, and
look for "P1".  But there is no P1, so we stop the process/loop.


Why do we need to look at entryrdn for parent's id? Is it faster?
I have not looked closely into it - so it might not be necessary to 
use entryrdn.  I thought it might be more efficient to use it. If you 
just use id2entry, you have to keep scanning it over and over, and 
starting over every time you need to read the next entry.  Maybe not 
though, maybe you can just "search" it and not have to scan it 
sequentially when trying to find parents and entries.  I'll leave that 
up to you to find out ;-)

Hi, now that I know the context here are some more comments.
If the purpose is to create a useful ldif file, which could eventually 
be used for import then formatting an entry correctly is not enough. 
Order of entries matters: parents need to come before children. We 
already handle this in db2ldif or replication total update.
That said, whenever you write an entry you always have seen the parent 
and could stack the dn with the parentid and createt the dn without 
using the entryrdn index.
You even need not to keep track of all the entry rdsn/dns - only the 
ones with children will be needed later, the presence of "numsubordinates"

identifies a parent.

Last but not least, since I think dbscan is broken for entryrdn, 
investigating and fixing this would also be nice


I mean the same information can be found in id2entry (?). Or this is 
not the case and dbscan does the exact same process you just 
described in order to print "parentid: X" for each entry when you do 
"dbscan -f id2entry.db"?


Thanks again,





___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


--
Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric 
Shander

___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Trying to understand entryrdn.db

2017-08-02 Thread Mark Reynolds


On 08/02/2017 11:49 AM, Ilias Stamatis wrote:
> Hello,
>
> I would like some help in order to understand entryrdn.db. When I do
> "dbscan -f entryrdn.db" I get something like:
>
> 3
>   ID: 3; RDN: "ou=Groups"; NRDN: "ou=groups"
>
> C3
> ID: 6; RDN: "cn=Accounting Managers"; NRDN: "cn=accounting managers"
>
> P6
> ID: 3; RDN: "ou=Groups"; NRDN: "ou=groups"
>
> I understand that 3 is this entry's ID, C3 means child of entry 3 and
> P6 means parent of entry 6.
>
> What I don't understand however is why those entries are repeated
> again and again. For example " ID: 7; RDN: "cn=HR Managers"; NRDN:
> "cn=hr managers" is repeated about a dozen of times in my entryrdn.
> And I don't mean like a parent, child, or whatever. It is repeated
> lots of time as ID 7 for example (but also many times as C3, etc.).
>
> I attach the complete output of what I get when I run "dbscan -f
> entryrdn.db", in order to demonstrate what I mean (my db contains
> almost default entries only).
>
> So my question is; how is this database filled?
Since I know that you are tying to work on ticket
https://pagure.io/389-ds-base/issue/47567, here is the short answer...

This is how you use id2enty and entryrdn together to achieve the desired
result...

id2entry:

id 1
rdn: dc=example,dc=com
objectClass: top
...

id 3
rdn: ou=Groups
...
...
parentid: 1

id 6
rdn: cn=Accounting Managers
...
...
parentid: 3   > points to ID 3 ("ou=Groups"), then "groups"
parent points to ID 1 ("dc=example,dc=com").  Final result
"cn=Accounting Managers, ou=Groups, dc=example,dc=com"


For the purposes of the ticket listed above you need to recreate the
full DN of each entry found in id2entry and then print its LDIF format. 
So you grab an entry from id2entry, find its parent id, then you
recursively keep looking at each parent in entryrdn, building up the DN
as you go, until there is no parent id found.


[1]  Get entry from id2entry and use its ID
[2]  Look in entryrdn for the parent of the ID
[3]  Keep looking for parents, building the DN as you go along

Example:

[1]  Get entry from id2entry:  ID 6 --> "cn=Accounting Managers"
[2]  Check entryrdn for "P".  In this case it's "P6" which is
"ou=Groups" with ID 3
[3]  So find "P3", which is "dc=example,dc=com" with ID 1, and look for
"P1".  But there is no P1, so we stop the process/loop.

Final result "cn=Accounting Managers, ou=Groups, dc=example,dc=com"


We just want it to be efficient, and not use a lot of memory.  This
needs to work on a 100 million entry db without consuming a lot of
resources.

I hope that helps.

Mark

>
> Thank you very much,
> Ilias
>
>
> ___
> 389-devel mailing list -- 389-devel@lists.fedoraproject.org
> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org

___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Trying to understand entryrdn.db

2017-08-02 Thread Ludwig Krispenz

Hi,

I think this is a problem of dbscan, which tries to  prettyprint the 
entryrdn index and seems to loop a bit.

If you do

 db_dump -d a entryrdn.db

you get the raw contents of the file , and you get much fewer records.

Ludwig

On 08/02/2017 05:49 PM, Ilias Stamatis wrote:

Hello,

I would like some help in order to understand entryrdn.db. When I do 
"dbscan -f entryrdn.db" I get something like:


3
  ID: 3; RDN: "ou=Groups"; NRDN: "ou=groups"

C3
ID: 6; RDN: "cn=Accounting Managers"; NRDN: "cn=accounting managers"

P6
ID: 3; RDN: "ou=Groups"; NRDN: "ou=groups"

I understand that 3 is this entry's ID, C3 means child of entry 3 and 
P6 means parent of entry 6.


What I don't understand however is why those entries are repeated 
again and again. For example " ID: 7; RDN: "cn=HR Managers"; NRDN: 
"cn=hr managers" is repeated about a dozen of times in my entryrdn. 
And I don't mean like a parent, child, or whatever. It is repeated 
lots of time as ID 7 for example (but also many times as C3, etc.).


I attach the complete output of what I get when I run "dbscan -f 
entryrdn.db", in order to demonstrate what I mean (my db contains 
almost default entries only).


So my question is; how is this database filled?

Thank you very much,
Ilias


___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


--
Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric 
Shander

___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org