Philippe Laflamme wrote:
I've worked on an implementation for Postgres. I used the Large Object API
provided by the Postgres JDBC driver. It works fine but I doubt it is very
scalable because the number of open connections during indexing can become
very high.

Lucene opens many different files when writing to an index. This results in
opening one PG connection per open file. While working on a small index (30
000 files), I saw the number of open connections become quite high (approx
150). If you don't have a lot of RAM, this is problematic.

A connection per file sounds very heavyweight.


The way I would try to implement Directory with SQL is to have a single
table of buffers per index, e.g., with columns ID, BLOCK_NUMBER and
DATA.  The contents of a file are the appended DATA columns with the
same ID, ordered by the BLOCK_NUMBER field.  This would be indexed by ID
and BLOCK_NUMBER, together a unique key.

The BLOCK_NUMBER field indicates which part of the file the row
concerns.  Thus the DATA of BLOCK_NUMBER=0 might hold the first 1024
bytes, the DATA of BLOCK_NUMBER=1 might hold the next 1024 bytes, and so
on.  This would permit efficient random access.

You'll need another table with NAME, ID, and MODIFIED_DATE, with a
single entry per file.  The length of a file can be computed with a
query that finds the length of DATA in the last BLOCK_NUMBER with an ID.

I would initially cache a single connection to the database and
serialize requests over it.  A pool of connections might be more
efficient when multiple threads are searching, but I would benchmark
that before investing much in such an implementation.

Has anyone yet implemented an SQL Directory this way?

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to