Re: lucene link database

Cam Bazz Sun, 08 Oct 2006 10:15:19 -0700

Dear Erick;

Thank you for your detailed insight. I have been trying to code a graphobject database for sometime.I have prototyped on relational as well as object oriented databases,including opensource and commercial implementations.(so far, I have tried hibernate, objectivity/db, db4o) while objectdatabases excel in traversing links, they are poor when searching.

lucene so far solves the problem of solving. I am thinking of a documentas a list of tuples. (sequence of fields) and I can do searches withlucene, it is really nice.

now I have to solve the problem of linking. if I keep the nodes with alucene index, and I can fetch documents with a doc_id, or some sort ofsurrogate identifier, anduse those identifiers as node_id in an object graph, that will be what Iwant. but in order to do that I need to be able to query the luceneindex by document_id.

I was referring to the link db of the nutch. They do have some sort oflink db implementation, that runs with hadoop, but I have not understoodthe full code.I am trying to understand the structure of this link database. I wasthinking of using documents with src and dst fields, that have documentid's as values. (one idea, I will try it tomorrow)


Again thanks a bunch.

Best Regards,
C.B.

Erick Erickson wrote:

Aproach it in whatever way you want as long as it solves your problem<G>.


My first question is why use lucene? Would a database suit your needs
better? Of course, I can't say. Lucene shines at full-text searching, so

it's a closer call if you aren't searching on parts of text. By that Imean

that if you're not searching on *parts* of your links, you may want to
consider a DB solution.

That said, and if I understand your requirement, you have a pretty simple

design. Each document has two fields, incominglinks and outgoinglinks. Butsee the note below. Lucene indexes what you give it, so the fact thatsome

of the links aren't hypertext links is immaterial to Lucene. Since you
control both the indexer and searcher, these confrom to whatever your
requirements are. It's up to you to map semantics onto these entities.

One common trap DB-savvy people have is that they think of documents as

entries in a table, all with the same fields. There is nothingrequiring you

to have the *same* fields in each document in an index. You could have an
index for which no two documents shared *any* common field if you choose.

So, if you want to find out what, say, which documents have link X as an
incoming link, just search on incominglinks:X. If you wanted to find the

documents that had any incoming links X, Y, Z that matched an outgoinglink

in another document, just search the OR of these in outgoinglinks.

If you want some kind of map of the whole web of links, you'll have towritesome iterative loop and keep track. There's nothing built in that Iknow ofthat lets you answer "Given link X, show me all the documents no morethan 3hops away". Lucene is an *engine*, designed to have apps built on topof it.

Lucene doesn't deal with relations between documents, just searching what
you've indexed.

It's easy enough to store a variable number of links in yourincominglinks

or outgoinglinks field. Just be sure they're tokenized appropriately. You
can add them any way you choose, either concatenate them all into a big
string and index that, or index them into the same field, e.g.
Document doc = new Document();
doc.add("incoming", "link1");
doc.add("incoming", "link2");
.
.
.
writer.add(doc);

According to a discussion from a while ago, this is the same as
doc.add("incoming", "link1 link2");
in terms of how it all gets handled internally.

NOTE: I'm skipping most of the question of which Analyzer you use.This will

almost surely trip you up sometime. I'd suggest starting with
WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers
will break your links up in ways you don't expect. Really, really, really

get a copy of Luke to see what's actually *in* your index and howsearcheswork. And how the analyzer you choose changes what's searched for, aswell

as what's indexec. Google lucene luke and you'll find it.

Anyway, hope this all helps.
Erick

On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote:


Hello,

I would like to make a link database using lucene. Similar to one that
nutch uses. I have read the basic documentation and understood how
document indexing, search, and scoring works. But what I like is
different documents having different kind of links (semantic links) to
each other. I would like to be able to search in the database like
incominglinksofdocument(id), outgoinglinksofdocument(id). the links I am
talking about, might not necessarily be hypertext links.

How would I approach to a problem like this?

Best Regards,
-C.B.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene link database

Reply via email to