I use JDBM store document's key ID. 2008/12/30 Chris Lu <chris...@gmail.com>
> Otis, thanks for the pointer. > I think the question can be: > > How to access TermEnum or TermInfos during indexing. > > If this is possible, things would be easier. > > -- > Chris Lu > ------------------------- > Instant Scalable Full-Text Search On Any Database/Application > site: http://www.dbsight.net > demo: http://search.dbsight.com > Lucene Database Search in 3 minutes: > > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > DBSight customer, a shopping comparison site, (anonymous per request) got > 2.6 Million Euro funding! > > On Mon, Dec 29, 2008 at 10:41 AM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wrote: > > > Chris, > > > > Mark Miller & Co. are working on (Near) Duplicate Detection. I think the > > work is in Solr's JIRA, but some of it might be applicable to Lucene. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > ----- Original Message ---- > > > From: Chris Lu <chris...@gmail.com> > > > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org> > > > Sent: Monday, December 29, 2008 4:55:14 AM > > > Subject: duplication checking while indexing > > > > > > I am wondering whether there is an easy way to avoid duplication while > > > indexing, just using the index being created, without creating other > data > > > structures. > > > In some cases, the incoming document list can have duplicates. For > > example, > > > when creating spell checking indexes for phrases. Each phrase is one > > > document. So I want to check whether the phrase is already indexed or > > not. > > > > > > To do so, I can either create a hash map for all the indexed phrases. > But > > > the hash map would consume a lot of memory. > > > A possible alternative is to search existing index. But remember the > > index > > > is being created, and not all contents are flushed to disk yet. > > > > > > Is it possible to query the not-yet-closed index? > > > > > > -- > > > Chris Lu > > > ------------------------- > > > Instant Scalable Full-Text Search On Any Database/Application > > > site: http://www.dbsight.net > > > demo: http://search.dbsight.com > > > Lucene Database Search in 3 minutes: > > > > > > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > > > DBSight customer, a shopping comparison site, (anonymous per request) > got > > > 2.6 Million Euro funding! > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > -- 刘平华 广州从兴电子开发有限公司 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: *:ialy2005 at gmail.com +:广州市广州大道南368号12F 510300 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 网站:http://www.searchfull.net Blog:http://www.searchfull.net/blog/