Re: [elephant-devel] db-postmodern: performance considerations

Ian Eslick Sat, 08 Dec 2007 05:53:56 -0800

Hi Alex & all,

1) In general we seem to have data stores that have different sweetspots and different optimal uses of an API. We probably should comeup with a Matrix of features and suggested uses for each of them.This includes what types are in lisp-sort-order and which are not foreach data store.


We should also have a spec of the common/safe features.

2) In retrospect it was a mistake to expose the cursor API to theusers, but Elephant started as a low-level interface to BDB with someminimal support for persistent objects and has grown from there so atthe time it made sense. I recommend we consider deprecating thecursor API for user use and find the use cases that people care aboutand implement a higher level API to them (abstract datastructures, aquery language, etc). Then we drop support for the cursor API (butdocument what works for those who want to shoot themselves in the foot).

The map API is intended to provide a slightly higher level interfaceto cursors so that you can imagine a traversal of a large set, onlythe current element of which 'must' be in memory so you can reasonablyexpect GC's to occur while doing a single map operation.

What prospects are there for having a reasonable map implementation inpostmodern?

3) get-instances-by-range should be implementable for integers andstrings on postmodern, but should flag a datastore-specific error if adifferent type is used. Then you should be able to do a SQL querythat returns all instances > some value and < another value. Thatshouldn't be hard to implement.

4) btrees are lightweight in BDB, but heavy in SQL - sounds like psetsneed a postmodern specific implementation. You need a table thatimplements a many-to-many relation between pset ids and object ids.


Thanks for all the hard work Alex!

Cheers,
Ian

PS - Has anyone validated my new default map-index under postmodern,or is that moot now that you have a specialized version?



On Dec 8, 2007, at 6:08 AM, Alex Mizrahi wrote:

hello
unfortunately rewrite of pm-cursor implementation is postponed fornow, itturned to be more complex than it seemed originally. and i do nothave test
cases anyway..
however i've made a patch that improves performance of
get-instances-by-value (i've sent it to Henrik so it should beavailable
when he'll synchronize with upstream).
so i'll describe profiling results here -- what's fast and what'snot with
postmodern backend.

1. individual operation in btree -- setting and getting value -- are
converted directly into SQL queries, and so they are quite fast (asquery in
indexed SQL table can be). however, there's considerable client/server
communication overhead (system calls, TCP/IP stack..), so makingtons of
individual sets/gets is not very fast.
however, it's possible to cache get operations on client withpostmodern
backend.
two caching options are available -- per-transaction-cache caches only
inside one transaction.
global-sync-cache caches data between transactions, synchronizing itwithother working instances -- it tracks changes and invalidates cacheentriesaccordingly. synchronization brings some overhead, and thus it'sgood only
for certain types of workload (mostly-read).
it's possible to implement global-cache for single-instance mode,that will
do no synchronization, but this mode is not implemented because
db-postmodern focuses of safety of use with multiple instancesworking with
database simultaneously.

get-instance-by-value, after a patches suggested by Alain, uses btree
get-value, and thus is cached too.

there's gotcha when using caches with large data sets -- cache entries
doesn't get garbage collected, so it's possible to run out ofmemory. it'spossible to use weak hash table in this case (see make-backend-cacheinpm-cache, no configuration option), but i don't know how goodcaching will
be with it. patch for some smarter solution is welcome.
2. get-instances-by-value, with my specialized map-indeximplementation,
uses SQL query to retrieve data directly, so it should be pretty fast.
however it doesn't use caching.
it doesn't uses cursor, so all instances are always returned, it maybe a
problem if there's a lot of instances for some key.

3. cursors: in general, they do not work very well for now.
first of all, they do not always return values in order as specifiedby"lisp sorter". moreover, if you do not have keys of same type(either allintegers or strings) you get a random order. (with integers andstrings youget order according to SQL comparison rules). we are not going tofix thisissue (besides, maybe, allowing NILs to be mixed with integers orstringssafely), since we believe it's more important to support goodperformance inpractical cases (having all keys of same type). we could implementsomeemulation mode that will retrieve all data and sort it on lisp side,but
probably it's better to use CLSQL backend if "lisp sort" ordering is a
requirement of an application.
then, probably the only thing that cursors are good at is iteratingall the
sequence from start to the end.
iterating some limited set of values from start *almost* works fine.to do
it efficiently on large table it's required to configure PostgreSQL --
disable hashjoin and mergejoin, otherwise it thinks that it's fastertoprocess whole table rather than doing it incrementally. processingtablewith 10000 items costs 60-70 milliseconds. (additionally i suspectsome
small patch to db-postmodern is required to build correct indices for
cursors to work incrementally).
iterating from end doesn't work good -- it scans through all table(probablyeven twice) due to bug in cursor implementation. this should be more-or-less
easily fixable, though.
cursor-set performance depends on how far is key you search for fromthestart of the table. (!). it's counting on postgresql side, though,so for10000 items table it should take about 70 msecs in worst case.multiple
cursor-set calls will get equally slow all.
i have some ideas how to fix cursor implementation so it will notdepend onsize of table (to the extent PostgreSQL does not depend, of course),but idon't know when i'll have a chance to implement it. as i'vementioned, i'vecurrently do not even have a practical test case where cursoriteration is
used.
4. get-instances-by-range uses cursors, and so it inherits all it'sproblems
5. psets: default implementation of psets makes instance of btreefor eachof them. in db-postmodern btree has it's own SQL table, so it's nota goodidea to have thousands of psets. although they might be quite usefulto havein big amounts, so probably we'll invent something better in futurefor
them.

with best regards, Alex 'killer_storm' Mizrahi.



_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel


_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Re: [elephant-devel] db-postmodern: performance considerations

Reply via email to