Otis, thanks for the pointer. I think the question can be: How to access TermEnum or TermInfos during indexing.
If this is possible, things would be easier. -- Chris Lu ------------------------- Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Mon, Dec 29, 2008 at 10:41 AM, Otis Gospodnetic < [email protected]> wrote: > Chris, > > Mark Miller & Co. are working on (Near) Duplicate Detection. I think the > work is in Solr's JIRA, but some of it might be applicable to Lucene. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > From: Chris Lu <[email protected]> > > To: "[email protected]" <[email protected]> > > Sent: Monday, December 29, 2008 4:55:14 AM > > Subject: duplication checking while indexing > > > > I am wondering whether there is an easy way to avoid duplication while > > indexing, just using the index being created, without creating other data > > structures. > > In some cases, the incoming document list can have duplicates. For > example, > > when creating spell checking indexes for phrases. Each phrase is one > > document. So I want to check whether the phrase is already indexed or > not. > > > > To do so, I can either create a hash map for all the indexed phrases. But > > the hash map would consume a lot of memory. > > A possible alternative is to search existing index. But remember the > index > > is being created, and not all contents are flushed to disk yet. > > > > Is it possible to query the not-yet-closed index? > > > > -- > > Chris Lu > > ------------------------- > > Instant Scalable Full-Text Search On Any Database/Application > > site: http://www.dbsight.net > > demo: http://search.dbsight.com > > Lucene Database Search in 3 minutes: > > > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > > DBSight customer, a shopping comparison site, (anonymous per request) got > > 2.6 Million Euro funding! > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
