Dear Erick;
Thank you for your detailed insight. I have been trying to code a graph
object database for sometime.
I have prototyped on relational as well as object oriented databases,
including opensource and commercial implementations.
(so far, I have tried hibernate, objectivity/db, db4o) while object
databases excel in traversing links, they are poor when searching.
lucene so far solves the problem of solving. I am thinking of a document
as a list of tuples. (sequence of fields) and I can do searches with
lucene, it is really nice.
now I have to solve the problem of linking. if I keep the nodes with a
lucene index, and I can fetch documents with a doc_id, or some sort of
surrogate identifier, and
use those identifiers as node_id in an object graph, that will be what I
want. but in order to do that I need to be able to query the lucene
index by document_id.
I was referring to the link db of the nutch. They do have some sort of
link db implementation, that runs with hadoop, but I have not understood
the full code.
I am trying to understand the structure of this link database. I was
thinking of using documents with src and dst fields, that have document
id's as values. (one idea, I will try it tomorrow)
Again thanks a bunch.
Best Regards,
C.B.
Erick Erickson wrote:
Aproach it in whatever way you want as long as it solves your problem
<G>.
My first question is why use lucene? Would a database suit your needs
better? Of course, I can't say. Lucene shines at full-text searching, so
it's a closer call if you aren't searching on parts of text. By that I
mean
that if you're not searching on *parts* of your links, you may want to
consider a DB solution.
That said, and if I understand your requirement, you have a pretty simple
design. Each document has two fields, incominglinks and outgoing
links. But
see the note below. Lucene indexes what you give it, so the fact that
some
of the links aren't hypertext links is immaterial to Lucene. Since you
control both the indexer and searcher, these confrom to whatever your
requirements are. It's up to you to map semantics onto these entities.
One common trap DB-savvy people have is that they think of documents as
entries in a table, all with the same fields. There is nothing
requiring you
to have the *same* fields in each document in an index. You could have an
index for which no two documents shared *any* common field if you choose.
So, if you want to find out what, say, which documents have link X as an
incoming link, just search on incominglinks:X. If you wanted to find the
documents that had any incoming links X, Y, Z that matched an outgoing
link
in another document, just search the OR of these in outgoinglinks.
If you want some kind of map of the whole web of links, you'll have to
write
some iterative loop and keep track. There's nothing built in that I
know of
that lets you answer "Given link X, show me all the documents no more
than 3
hops away". Lucene is an *engine*, designed to have apps built on top
of it.
Lucene doesn't deal with relations between documents, just searching what
you've indexed.
It's easy enough to store a variable number of links in your
incominglinks
or outgoinglinks field. Just be sure they're tokenized appropriately. You
can add them any way you choose, either concatenate them all into a big
string and index that, or index them into the same field, e.g.
Document doc = new Document();
doc.add("incoming", "link1");
doc.add("incoming", "link2");
.
.
.
writer.add(doc);
According to a discussion from a while ago, this is the same as
doc.add("incoming", "link1 link2");
in terms of how it all gets handled internally.
NOTE: I'm skipping most of the question of which Analyzer you use.
This will
almost surely trip you up sometime. I'd suggest starting with
WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers
will break your links up in ways you don't expect. Really, really, really
get a copy of Luke to see what's actually *in* your index and how
searches
work. And how the analyzer you choose changes what's searched for, as
well
as what's indexec. Google lucene luke and you'll find it.
Anyway, hope this all helps.
Erick
On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote:
Hello,
I would like to make a link database using lucene. Similar to one that
nutch uses. I have read the basic documentation and understood how
document indexing, search, and scoring works. But what I like is
different documents having different kind of links (semantic links) to
each other. I would like to be able to search in the database like
incominglinksofdocument(id), outgoinglinksofdocument(id). the links I am
talking about, might not necessarily be hypertext links.
How would I approach to a problem like this?
Best Regards,
-C.B.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]