I found that in the data I'm searching I have a lot of duplicated content. Only diference is that the url change, ie, one say http://localhost/sample.html and the other http://localhost/sample2.html. However, sample1 and sample2 are diferent files, that its, here is not involved redirection or linking or nothing like that. Sample1 and Sample2 are two diferent pages copied in diferent dates but with the exact same content.
I think this account for something like 20% of the cases, so I think is valuable avoid to index all of this. So I'm thinking in build a link/location + content databases, in one put the list of links/urls and in the other only content, so I have a start structure around the content... But I wondering if exist smart way to do this in the actual Lucene 1.4codebase.... -- Mario Alejandro Montoya http://sourceforge.net/projects/mutis MUTIS: The Open source Delphi search engine AnyNET: Convert from ANY .NET assembly to Delphi code
