Aproach it in whatever way you want as long as it solves your problem <G>.
My first question is why use lucene? Would a database suit your needs better? Of course, I can't say. Lucene shines at full-text searching, so it's a closer call if you aren't searching on parts of text. By that I mean that if you're not searching on *parts* of your links, you may want to consider a DB solution. That said, and if I understand your requirement, you have a pretty simple design. Each document has two fields, incominglinks and outgoing links. But see the note below. Lucene indexes what you give it, so the fact that some of the links aren't hypertext links is immaterial to Lucene. Since you control both the indexer and searcher, these confrom to whatever your requirements are. It's up to you to map semantics onto these entities. One common trap DB-savvy people have is that they think of documents as entries in a table, all with the same fields. There is nothing requiring you to have the *same* fields in each document in an index. You could have an index for which no two documents shared *any* common field if you choose. So, if you want to find out what, say, which documents have link X as an incoming link, just search on incominglinks:X. If you wanted to find the documents that had any incoming links X, Y, Z that matched an outgoing link in another document, just search the OR of these in outgoinglinks. If you want some kind of map of the whole web of links, you'll have to write some iterative loop and keep track. There's nothing built in that I know of that lets you answer "Given link X, show me all the documents no more than 3 hops away". Lucene is an *engine*, designed to have apps built on top of it. Lucene doesn't deal with relations between documents, just searching what you've indexed. It's easy enough to store a variable number of links in your incominglinks or outgoinglinks field. Just be sure they're tokenized appropriately. You can add them any way you choose, either concatenate them all into a big string and index that, or index them into the same field, e.g. Document doc = new Document(); doc.add("incoming", "link1"); doc.add("incoming", "link2"); . . . writer.add(doc); According to a discussion from a while ago, this is the same as doc.add("incoming", "link1 link2"); in terms of how it all gets handled internally. NOTE: I'm skipping most of the question of which Analyzer you use. This will almost surely trip you up sometime. I'd suggest starting with WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers will break your links up in ways you don't expect. Really, really, really get a copy of Luke to see what's actually *in* your index and how searches work. And how the analyzer you choose changes what's searched for, as well as what's indexec. Google lucene luke and you'll find it. Anyway, hope this all helps. Erick On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote:
Hello, I would like to make a link database using lucene. Similar to one that nutch uses. I have read the basic documentation and understood how document indexing, search, and scoring works. But what I like is different documents having different kind of links (semantic links) to each other. I would like to be able to search in the database like incominglinksofdocument(id), outgoinglinksofdocument(id). the links I am talking about, might not necessarily be hypertext links. How would I approach to a problem like this? Best Regards, -C.B. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]