Something else to be aware of: if you are using a 'compare'-style
index as in this example, and you want the index to be kept up to
date when the attribute values change, you need to add a 'monitor'
keyword to the addIndex( ) call, like so:
emailAddressCollection.rep.addIndex('emailAddress', 'compare',
compare='_compareAddr', monitor=('emailAddress'))
This lets the index code know which attribute(s) it needs to
monitor. Other index styles (such as 'attribute') have monitors
added for you automatically, and thus you don't need the 'monitor'
keyword.
~morgen
On Jan 24, 2006, at 4:18 PM, Andi Vajda wrote:
The text below is a cross-post to this list from the a recent
chandlerdb blog post http://blogs.osafoundation.org/chandlerdb/
2006/01/using_collectio.html
as per Ted's request.
Andi..
----------------------------------------------------------------------
-----
January 24, 2006
Using collection indexes to find items
During last week's sprints I was asked to see how easy it was to
import mail
into Chandler from a local mailbox. Thanks to Python's mailbox and
email
packages, the mailbox parsing was trivial. Similarly, Chandler's
domain
model can represent email items and has a number of APIs that make
creating
such items from a raw email string very easy.
Unexpectedly, importing mail messages was very slow however, and
was getting
worse as more messages were added to the repository. The first
cause of
slowness was easily resolved by turning off 'bz2' compression of
all the
text LOBs being created.
The second cause of slowness was more surprising though. It had to
do with
looking up existing email address items from the email addresses
occurring
in the headers of the messages being imported. The EmailAddress
item class
defined here has a class method called getEmailAddress() that will
linearly
iterate all EmailAddresses instances every time such an item is
sought. The
mailbox I was trying to import contains about 6,500 mails with 800
unique
email addresses. Linearly searching them won't scale.
In the early days of the repository we had planned on having a
query system
and Ted even implemented a query language not unlike one you'd
encounter in
an SQL database. Last year, while working on Chandler 0.6, it
occurred to us
that a formal query language in a Python-based system was somewhat
redundant
and that we should be using computed collections instead of queries
altogether to accomplish the same goals . Hence abstract sets and
their item
wrappers were implemented.
By combining abstract sets, collections and indexes - something I
had added
to bi-directional reference collections during the Chandler 0.5
release - we
realized we could get very decent item look-up performance as well
as cache
the computed aspect of abstract sets making membership tests and
iteration
also considerably faster.
This technique can be illustrated with the example below taken from
the
EmailAddress index I added during last week's sprints.
* First, a collection needs to be setup. This can be as simple
as a
bi-directional reference collection, an abstract collection
of all
items of a given kind or a more complicated computed collection
combining or filtering other collections. In the EmailAddress
example,
I just created a KindCollection instance based on the
EmailAddress
kind. In the app parcel I added yet another collection called
emailAddressCollection.
emailAddressCollection = \
KindCollection.update(parcel, 'emailAddressCollection',
kind=pim.mail.EmailAddress.getKind
(view),
recursive=True)
* Then, a sorted index needs be added to the collection. I
wanted to be
able to find existing email addresses such that no two email
address
items in the collection have the same lowercase internet
email address
string. For this purpose, I added the _compareAddr() method
to the
EmailAddress class in the mail parcel:
def _compareAddr(self, other):
return cmp(self.emailAddress.lower(),
other.emailAddress.lower())
and the index creation call after the code creating the
collection:
emailAddressCollection.rep.addIndex('emailAddress', 'compare',
compare='_compareAddr')
which creates a 'compare' index called 'emailAddress', an index
calling a method called '_compareAddr' on the items it is
comparing.
* And finally to use this collection and the index to find
items I added
a findEmailAddress() class method to the EmailAddress class:
@classmethod
def findEmailAddress(cls, view, emailAddress):
collection = schema.ns("osaf.app",
view).emailAddressCollection.rep
emailAddress = emailAddress.lower()
def compareAddr(key):
return cmp(emailAddress, view
[key].emailAddress.lower())
uuid = collection.findInIndex('emailAddress',
'exact', compareAddr)
if uuid is None:
return None
return view[uuid]
The findInIndex() call looks for an exact match as per the
compareAddr
local function which compares lowercase versions of internet
address
strings.
The technique above replaces the linear search with a binary search
yielding
a very noticeable performance improvement in the overall importing
of email
messages into the repository.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev