Mario, Lucene != web indexer, so Lucene doesn't know anything about files or URLs, etc. It just indexes what it's told. You should check how Nutch does it, and I believe it does it by comparing "fingerprints" of web pages. Fingerprints are MD5 checksums, but I believe the recent changes there allow you to define your own mechanism.
In any case, this is not really a question for [EMAIL PROTECTED] nutch-user@ may be a a better place to ask. Otis ----- Original Message ---- From: Mario Alejandro M. <[EMAIL PROTECTED]> To: Lucene Developers List <[email protected]> Sent: Fri 20 Jan 2006 05:27:01 PM EST Subject: Indexing Urls pointing to same content I found that in the data I'm searching I have a lot of duplicated content. Only diference is that the url change, ie, one say http://localhost/sample.html and the other http://localhost/sample2.html. However, sample1 and sample2 are diferent files, that its, here is not involved redirection or linking or nothing like that. Sample1 and Sample2 are two diferent pages copied in diferent dates but with the exact same content. I think this account for something like 20% of the cases, so I think is valuable avoid to index all of this. So I'm thinking in build a link/location + content databases, in one put the list of links/urls and in the other only content, so I have a start structure around the content... But I wondering if exist smart way to do this in the actual Lucene 1.4codebase.... -- Mario Alejandro Montoya http://sourceforge.net/projects/mutis MUTIS: The Open source Delphi search engine AnyNET: Convert from ANY .NET assembly to Delphi code --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
