Vikas: WebDB, Segements -- you are correct on both, but look at their documentation as well. For Web DB, please read the EXCELLENT documentation in the wiki For Segment, wiki and search the archives For Index, look at the Lucene (Apache project) documentation (read this first, as Nutch is built on Lucene, getting a good handle on Lucene will make the process easier)
The lucene add-ons should give you an idea on how to bold and other good stuff (again on the Lucene site). NOTE: Not sure about changing the Summary in Index and not affecting the segments (the index gets its data from the Segment). Now, if you just wanted to bold at search time, this Nutch already does. To get you started with the algorithm, the FAQs of Lucene have a similar questions.."How are documents scored" or something like that. You'll see stuff like "idf" , "coord" (thus, you can do a word find using the Google toolbar :-) Now for the other part of Ranking, is the boost factor (as set by Nutch when it Indexes in Lucne). The boost factor we get from "nutch analyze". And the last step affecting the ranking AT SEARCH TIME, is the same process that is used for a document, is applied to the Query string. Thus, you are correct, it is using the vector space model. As Nutch calls Lucene search --look at the scoring algo in Lucene. In short, 80% of the stuff you asked will be answered by turning attention to Lucene. Hope this helps. CC- -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Vikas Gupta Sent: Wednesday, November 17, 2004 12:49 AM To: [EMAIL PROTECTED] Subject: [JNK] [Nutch-dev] adding more fields to indices Hi nutch developers, I am new to nutch and lucene. Using nutch to do some heuristic testing for web search. Could you verify the following. I guess right now, the way indices are organized is db - has the web graph segments - has the rest of the stuff like actual page content, anchor texts, title etc. (0) Is this correct? I need to add fields to the indices like "texts in bold" - ... "description keywords" ... (1) So, this will only affect the stuff in segments indices right and not the db index at all? (2) Also, could you point me to what is the name of the real algorithm used in lucene to find the score of a query wrt to the indices? I did take a look at Similarity classes. Looks like some sort of vector space model as there are funcs like queryNorm. Thanks. ____________________________________________________________________ Vikas Gupta Final Year Masters Student, http://www.cs.utexas.edu/users/vgupta Dept. of Computer Sciences, Univ. of Texas at Austin, USA ____________________________________________________________________ ------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
