Eric Ridge <[EMAIL PROTECTED]> writes:Xapian has it's own storage subsystem, and that's what I'm using to store the index... not using anything internal to postgres (although this could change).
I would say you have absolutely zero chance of making it work that way.
I still think this is one of the best quotes I've heard in awhile. :)
It might be worth pointing out here than an index AM is not bound to use
exactly the typical Postgres page layout.
Thanks again for this little bit of info. It was just enough to get me thinking about how to make "it work".
Xapian is basically a big btree index, except it's 5 btree indexes. One for terms, one for posts (terms with positions), one for positions, one for arbitrary document values, and one for the documents themselves. Each index is made up of 3 physical files on disk. All told there's 17 files for a single Xapian index (15 db files, a versioninfo file, and a lock file).
I couldn't think of a way to create a whole new database type for Xapian that could deal with managing 5 btree indexes inside of Postgres (other than using tables w/ standard postgres btree index on certain fields), so instead, I dug into Xapian and abstracted out it's filesystem i/o (open, read, write, etc).
(as an aside, I did spend some time pondering ways to adapt Postgres' nbtree AM to handle this, but I just don't understand how it works)
Once I had Xapian's filesystem i/o encapsulated into a nice little C++ class, I embarked on creating a mini "filesystem" ontop of Postgres' storage subsystem. In essence, I've now got a Postgres access method that mirrors the basics of a filesystem, from creating/open files to reading from and writing to them, in addition to truncation and deletion.
After that, it was just a matter of the glue code to teach Xapian to use this "filesystem" for all its filesystem i/o, and voila!, Xapian works ontop of Postgres' storage subsystem and I didn't have to rewrite Xapian from scratch. And surprisingly, despite the additional overhead of this filesystem abstraction layer, it's still very fast... esp. once Buffers get cached.
I've still got more work to do (like dealing with locking and general concurrency issues, not to mention bugs I haven't found yet), but it's working *really* well in a single-user environment.
So here's the important question: How stupid is this?
I've done some benchmarking against tsearch2. Attached are the queries and execution times on my dual 2gig G5 w/ 2gig ram.
The table contains 51,160 records. It's every text file contained on my computer (which includes multiple copies of all my java projects). All told, it's 337,343,569 bytes of data, with an average file size of 6,594 bytes. The Xapian operator is "=>", and tsearch2's operator is "@@". I ran each query 6 times, and just took the best execution time.
It's also worth noting that my document parser is much different than tsearch2's. I'm splitting words on non-alphanumerics (and currently am not using stopwords), and it seems that tsearch2 tries to do something more intelligent, so the # of results returned vary widely between tsearch2 and Xapian. I'm not offering an opinion on which way is "better".
I've got a few more questions about transactions, locking, and a few other things, but I just thought I'd throw this out as a status report and to see if there's any kind of reaction.
thanks for your time.
eric
query | # rows | execution time | -------------------------------------------------------+------------+-----------------+ contents_fti @@ 'java'; | 1032 | 487.542 ms | contents_fti @@ 'java & util & hashmap'; | 15 | 125.572 ms | contents_fti @@ 'java.util.hashmap'; | 865 | 773.792 ms | contents_fti @@ 'java.util.hashmap & java.util.list' | 113 | 293.114 ms | contents_fti @@ 'java.sql.connection' | 547 | 398.399 ms | contents_fti @@ 'socket & server &! http' | 718 | 364.860 ms | contents_fti @@ 'md5util' | 15 | 401.552 ms | contents_fti @@ 'intellij & idea & license' | 0 | 119.122 ms |
contents => 'java' | 33548 | 1665.731 ms | contents => 'java AND util AND hashmap' | 1977 | 129.529 ms | contents => '"java util hashmap"' | 870 | 279.489 ms | contents => '"java util hashmap" AND "java util list"' | 113 | 344.612 ms | contents => '"java sql connection"' | 574 | 219.617 ms | contents => 'socket AND server -http' | 864 | 73.478 ms | contents => 'md5util' | 159 | 27.023 ms | contents => 'intellij AND idea AND license' | 15 | 8.591 ms |
---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html