Vikas:

WebDB, Segements -- you are correct on both, but look at their documentation
as well.
 
For Web DB, please read the EXCELLENT documentation in the wiki
For Segment, wiki and search the archives
For Index, look at the Lucene (Apache project) documentation (read this
first, as Nutch is built on Lucene, getting a good handle on Lucene will
make the process easier)

The lucene add-ons should give you an idea on how to bold and other good
stuff (again on the Lucene site).

NOTE: Not sure about changing the Summary in Index and not affecting the
segments (the index gets its data from the Segment). Now, if you just wanted
to bold at search time, this Nutch already does.

To get you started with the algorithm, the FAQs of Lucene have a similar
questions.."How are documents scored" or something like that. You'll see
stuff like "idf" , "coord" (thus, you can do a word find using the Google
toolbar :-)  

Now for the other part of Ranking, is the boost factor (as set by Nutch when
it Indexes in Lucne). The boost factor we get from "nutch analyze". And the
last step affecting the ranking AT SEARCH TIME, is the same process that is
used for a document, is applied to the Query string. Thus, you are correct,
it is using the vector space model. As Nutch calls Lucene search --look at
the scoring algo in Lucene.

In short, 80% of the stuff you asked will be answered by turning attention
to Lucene.

Hope this helps.

CC-


 

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Vikas
Gupta
Sent: Wednesday, November 17, 2004 12:49 AM
To: [EMAIL PROTECTED]
Subject: [JNK] [Nutch-dev] adding more fields to indices

Hi nutch developers,

I am new to nutch and lucene. Using nutch to do some heuristic testing for
web search.

Could you verify the following.

I guess right now, the way indices are organized is

db - has the web graph

segments - has the rest of the stuff like actual page content, anchor texts,
title etc.

(0) Is this correct?

I need to add fields to the indices like

"texts in bold" - ...
"description keywords" ...

(1) So, this will only affect the stuff in segments indices right and not
the db index at all?

(2) Also, could you point me to what is the name of the real algorithm used
in lucene to find the score of a query wrt to the indices? I did take a look
at Similarity classes. Looks like some sort of vector space model as there
are funcs like queryNorm.

Thanks.


____________________________________________________________________
Vikas Gupta
Final Year Masters Student,   http://www.cs.utexas.edu/users/vgupta
Dept. of Computer Sciences,
Univ. of Texas at Austin, USA
____________________________________________________________________


-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD -
A multidimensional database that combines robust object and relational
technologies, making it a perfect match for Java, C++,COM, XML, ODBC and
JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers





-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to