I should probably direct this to Doug Cutting, but following that thread I come to Doug's post at http://www.mail-archive.com/[email protected]/msg12709.html .
Doug says:
> 1. On the index master, periodically checkpoint the index. Every minute or
so the IndexWriter is closed and a 'cp -lr index index.DATE' command is
executed from Java, where DATE is the current date and time. This
efficiently makes a copy of the index when its in a consistent state by
constructing a tree of hard links. If Lucene re-writes any files (e.g., the
segments file) a new inode is created and the copy is unchanged.
How can that be so? When the segments file is re-written it will surely
clobber the copy rather than creating a new INODE, because it has the same
name... wouldn't it?
What makes it different from (say)...
mkdir x
echo original > x/x.txt
cp -lr x x.copy
echo update > x/x.txt
diff x/x.txt x.copy/x.txt
...where x.copy/x.txt has "update" rather than "original" (certainly on
Linux).
-----Original Message-----
From: James Pine [mailto:[EMAIL PROTECTED]
Sent: 06 July 2006 20:09
To: [email protected]
Subject: RE: Managing a large archival (and constantly changing) database
Hey,
I found this thread to be very useful when deciding upon an indexing
strategy.
http://www.mail-archive.com/[email protected]/msg12700.html
The system I work on has 3 million or so documents and it was (until a
non-lucene performance issue came up) setup to add/delete new documents
every 15 minutes in a similar manner as described in the thread. We were
adding/deleting a few thousand documents every 15 minutes, during peak
traffic. We have a dedicated indexing machine and distribute portions of our
index across multiple machines, but you could still follow the pattern all
on one box, just with separate processes/threads.
Even though lucene allows certain types of index operations to happen
concurrently with search activity, IMHO, if you can decouple the indexing
process from the searching process your system as a whole will be more
flexible and scalable with only a little extra maintenance overhead.
JAMES
--- Larry Ogrodnek <[EMAIL PROTECTED]> wrote:
> We have a similar setup, although probably only 1/5th the number of
> documents and updates. I'd suggest just making periodic index
> backups.
>
> I've been storing my index as follows:
>
> <workdir>/<index-name>/data/ (lucene index
> directory)
> <workdir>/<index-name>/backups/
>
> The "data" is what's passed into
> IndexWriter/IndexReader. Additionally, I create/update a .last_update
> file, which just contains the timestamp of when the last update was
> started, so when the app starts up it only needs to retrieve updates
> from the db since then.
>
> Periodically the app copies the contents of data into a new directory
> in backups named by the date/time, e.g.
> backups/2007-07-04.110051. If
> needed, I can delete data and replace the contents with the latest
> backup, and the app will only retrieve records updated since the
> backup was made (using the backup's .last_update)...
>
> I'd recommend making the complete index creation from scratch a normal
> operation as much as possible (but you're right, for that number of
> documents it will take awhile). It's been really helpful here when
> doing additional deploys for testing, or deciding we want to index
> things differently, etc...
>
> -larry
>
>
> -----Original Message-----
> From: Scott Smith [mailto:[EMAIL PROTECTED]
>
> Sent: Thursday, July 06, 2006 1:48 PM
> To: [email protected]
> Subject: Managing a large archival (and constantly
> changing) database
>
> I've been asked to do a project which provides full-text search for a
> large database of articles. The expectation is that most of the
> articles are fairly small (<2k bytes). There will be an initial
> population of around 400,000 articles. There will then be
> approximately 2000 new articles added each day (they need to be added
> in "real time"
> (within a few minutes of arrival), but will be spread out during the
> day). So, roughly another 700,000 articles each year.
>
>
>
> I've read enough to believe that having a lucene database of several
> million articles is doable. And, adding 2000 articles per day
> wouldn't seem to be that many. My concern is the real-time nature of
> the application. I'm a bit nervous (perhaps without
> justification) at
> simply growing one monolithic lucene database.
> Should there be a crash,
> the database will be unusable and I'll have to rebuild from scratch
> (which, based on my experience, would be hours of time).
>
>
>
> Some of my thoughts were:
>
> 1) having monthly databases and using
> MultiSearcher to search across
> them. That way my exposure for a corrupted database is limited to
> this month's database. This would also seem to give me somewhat
> better control--meaning a) if the search was generating lots of hits,
> I could display the results a month at a time and not bury them with
> output. It would also spread their search CPU out better and not
> prevent other individuals from doing a search. If there were very few
> results, I could sleep between each month's search and again, not lock
> everyone else out from searches.
>
> 2) Have a "this month's" searchable and an
> "everything else"
> searchable. At the beginning of each month, I would consolidate the
> previous month's database into the "everything else"
> searchable. This
> would give more consistent results for relevancy ranked searches.
> But, it means that a bad search could return lots of results.
>
>
>
> Has anyone else dealt with a similar problem? Am I
> expecting too much
> from Lucene running on a single machine (or should I
> be looking at
> Hadoop?). Any comments or links to previous
> discussions on this topic
> would be appreciated.
>
>
>
> Scott
>
>
>
>
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
smime.p7s
Description: S/MIME cryptographic signature
