Re: [Dev] HOWTO: fast index-based item lookup

Morgen Sagen Mon, 30 Jan 2006 11:31:34 -0800

Something else to be aware of: if you are using a 'compare'-styleindex as in this example, and you want the index to be kept up todate when the attribute values change, you need to add a 'monitor'keyword to the addIndex( ) call, like so:

emailAddressCollection.rep.addIndex('emailAddress', 'compare',compare='_compareAddr', monitor=('emailAddress'))

This lets the index code know which attribute(s) it needs tomonitor. Other index styles (such as 'attribute') have monitorsadded for you automatically, and thus you don't need the 'monitor'keyword.


~morgen

On Jan 24, 2006, at 4:18 PM, Andi Vajda wrote:

The text below is a cross-post to this list from the a recentchandlerdb blog post http://blogs.osafoundation.org/chandlerdb/2006/01/using_collectio.html
as per Ted's request.

Andi..
---------------------------------------------------------------------------
January 24, 2006
Using collection indexes to find items
During last week's sprints I was asked to see how easy it was toimport mailinto Chandler from a local mailbox. Thanks to Python's mailbox andemailpackages, the mailbox parsing was trivial. Similarly, Chandler'sdomainmodel can represent email items and has a number of APIs that makecreating
such items from a raw email string very easy.
Unexpectedly, importing mail messages was very slow however, andwas gettingworse as more messages were added to the repository. The firstcause ofslowness was easily resolved by turning off 'bz2' compression ofall the
text LOBs being created.
The second cause of slowness was more surprising though. It had todo withlooking up existing email address items from the email addressesoccurringin the headers of the messages being imported. The EmailAddressitem classdefined here has a class method called getEmailAddress() that willlinearlyiterate all EmailAddresses instances every time such an item issought. Themailbox I was trying to import contains about 6,500 mails with 800unique
email addresses. Linearly searching them won't scale.
In the early days of the repository we had planned on having aquery systemand Ted even implemented a query language not unlike one you'dencounter inan SQL database. Last year, while working on Chandler 0.6, itoccurred to usthat a formal query language in a Python-based system was somewhatredundant
and that we should be using computed collections instead of queries
altogether to accomplish the same goals . Hence abstract sets andtheir item
wrappers were implemented.
By combining abstract sets, collections and indexes - something Ihad addedto bi-directional reference collections during the Chandler 0.5release - werealized we could get very decent item look-up performance as wellas cachethe computed aspect of abstract sets making membership tests anditeration
also considerably faster.
This technique can be illustrated with the example below taken fromthe
EmailAddress index I added during last week's sprints.
* First, a collection needs to be setup. This can be as simpleas abi-directional reference collection, an abstract collectionof all
      items of a given kind or a more complicated computed collection
combining or filtering other collections. In the EmailAddressexample,I just created a KindCollection instance based on theEmailAddress
      kind. In the app parcel I added yet another collection called
      emailAddressCollection.

      emailAddressCollection = \
          KindCollection.update(parcel, 'emailAddressCollection',
kind=pim.mail.EmailAddress.getKind(view),
                                recursive=True)
* Then, a sorted index needs be added to the collection. Iwanted to beable to find existing email addresses such that no two emailaddressitems in the collection have the same lowercase internetemail addressstring. For this purpose, I added the _compareAddr() methodto the
      EmailAddress class in the mail parcel:

       def _compareAddr(self, other):
return cmp(self.emailAddress.lower(),other.emailAddress.lower())
and the index creation call after the code creating thecollection:
      emailAddressCollection.rep.addIndex('emailAddress', 'compare',
                                          compare='_compareAddr')

      which creates a 'compare' index called 'emailAddress', an index
calling a method called '_compareAddr' on the items it iscomparing.
* And finally to use this collection and the index to finditems I added
      a findEmailAddress() class method to the EmailAddress class:

          @classmethod
          def findEmailAddress(cls, view, emailAddress):
collection = schema.ns("osaf.app",view).emailAddressCollection.rep
              emailAddress = emailAddress.lower()

              def compareAddr(key):
return cmp(emailAddress, view[key].emailAddress.lower())
uuid = collection.findInIndex('emailAddress','exact', compareAddr)
              if uuid is None:
                  return None

              return view[uuid]
The findInIndex() call looks for an exact match as per thecompareAddrlocal function which compares lowercase versions of internetaddress
      strings.
The technique above replaces the linear search with a binary searchyieldinga very noticeable performance improvement in the overall importingof email
messages into the repository.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev

Re: [Dev] HOWTO: fast index-based item lookup

Reply via email to