Re: lucene link database

Erick Erickson Sun, 08 Oct 2006 07:23:11 -0700

Aproach it in whatever way you want as long as it solves your problem <G>.


My first question is why use lucene? Would a database suit your needs
better? Of course, I can't say. Lucene shines at full-text searching, so
it's a closer call if you aren't searching on parts of text. By that I mean
that if you're not searching on *parts* of your links, you may want to
consider a DB solution.

That said, and if I understand your requirement, you have a pretty simple
design. Each document has two fields, incominglinks and outgoing links. But
see the note below. Lucene indexes what you give it, so the fact that some
of the links aren't hypertext links is immaterial to Lucene. Since you
control both the indexer and searcher, these confrom to whatever your
requirements are. It's up to you to map semantics onto these entities.

One common trap DB-savvy people have is that they think of documents as
entries in a table, all with the same fields. There is nothing requiring you
to have the *same* fields in each document in an index. You could have an
index for which no two documents shared *any* common field if you choose.

So, if you want to find out what, say, which documents have link X as an
incoming link, just search on incominglinks:X. If you wanted to find the
documents that had any incoming links X, Y, Z that matched an outgoing link
in another document, just search the OR of these in outgoinglinks.

If you want some kind of map of the whole web of links, you'll have to write
some iterative loop and keep track. There's nothing built in that I know of
that lets you answer "Given link X, show me all the documents no more than 3
hops away". Lucene is an *engine*, designed to have apps built on top of it.
Lucene doesn't deal with relations between documents, just searching what
you've indexed.

It's easy enough to store a variable number of links in your incominglinks
or outgoinglinks field. Just be sure they're tokenized appropriately. You
can add them any way you choose, either concatenate them all into a big
string and index that, or index them into the same field, e.g.
Document doc = new Document();
doc.add("incoming", "link1");
doc.add("incoming", "link2");
.
.
.
writer.add(doc);

According to a discussion from a while ago, this is the same as
doc.add("incoming", "link1 link2");
in terms of how it all gets handled internally.


NOTE: I'm skipping most of the question of which Analyzer you use. This will
almost surely trip you up sometime. I'd suggest starting with
WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers
will break your links up in ways you don't expect. Really, really, really
get a copy of Luke to see what's actually *in* your index and how searches
work. And how the analyzer you choose changes what's searched for, as well
as what's indexec. Google lucene luke and you'll find it.

Anyway, hope this all helps.
Erick

On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote:


Hello,

I would like to make a link database using lucene. Similar to one that
nutch uses. I have read the basic documentation and understood how
document indexing, search, and scoring works. But what I like is
different documents having different kind of links (semantic links) to
each other. I would like to be able to search in the database like
incominglinksofdocument(id), outgoinglinksofdocument(id). the links I am
talking about, might not necessarily be hypertext links.

How would I approach to a problem like this?

Best Regards,
-C.B.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene link database

Reply via email to