On Fri, Feb 06, 2004 at 11:22:07AM -0500, Philippe Laflamme wrote: > > A connection per file sounds very heavyweight. > > Indeed it is. Using Postgres' LargeObjects to represent a file has its > limitations: everytime Lucene requires a stream on a file, a connection is > required (and cannot be shared). Implementing it this way was quick, but not > at all optimal.
But large objects have much better read/write performence than using regular text fields. > > Your suggestion is quite interesting. It would not require the usage of > Blobs which are not very portable. It could be implement using standard SQL > types and would make an elegant SQLDirectory (and not an RDBMS specific > Directory). I suspect you're going to get lousy performence compared to using regular files. Why is it that you want to save the index files in a db? It's not like you'll have any additional meta data or functionality. The only advantage that I can think of is that you can have control of read/write locking across machines. In other words, you can have one machine doing the writing and one or more machines doing the reading/searching. > > Has anyone tried something like this? I could modify the PgSQLDirectory I've > created, but having previous experiences would be very helpful... > > Regards, > Phil > > > -----Original Message----- > > From: Doug Cutting [mailto:[EMAIL PROTECTED] > > Sent: February 5, 2004 16:51 > > To: Lucene Users List > > Subject: Re: SQLDirectory > > > > > > Philippe Laflamme wrote: > > > I've worked on an implementation for Postgres. I used the Large > > Object API > > > provided by the Postgres JDBC driver. It works fine but I doubt > > it is very > > > scalable because the number of open connections during indexing > > can become > > > very high. > > > > > > Lucene opens many different files when writing to an index. > > This results in > > > opening one PG connection per open file. While working on a > > small index (30 > > > 000 files), I saw the number of open connections become quite > > high (approx > > > 150). If you don't have a lot of RAM, this is problematic. > > > > A connection per file sounds very heavyweight. > > > > The way I would try to implement Directory with SQL is to have a single > > table of buffers per index, e.g., with columns ID, BLOCK_NUMBER and > > DATA. The contents of a file are the appended DATA columns with the > > same ID, ordered by the BLOCK_NUMBER field. This would be indexed by ID > > and BLOCK_NUMBER, together a unique key. > > > > The BLOCK_NUMBER field indicates which part of the file the row > > concerns. Thus the DATA of BLOCK_NUMBER=0 might hold the first 1024 > > bytes, the DATA of BLOCK_NUMBER=1 might hold the next 1024 bytes, and so > > on. This would permit efficient random access. > > > > You'll need another table with NAME, ID, and MODIFIED_DATE, with a > > single entry per file. The length of a file can be computed with a > > query that finds the length of DATA in the last BLOCK_NUMBER with an ID. > > > > I would initially cache a single connection to the database and > > serialize requests over it. A pool of connections might be more > > efficient when multiple threads are searching, but I would benchmark > > that before investing much in such an implementation. > > > > Has anyone yet implemented an SQL Directory this way? > > > > Doug > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]