Hi all:
I recently wrote Doug and volunteered to help with the parameter tweaking.
Doug gave me pointers on where to find the parameters. I have been
studying the code and documentation intensively--as well as doing searches
and reading the explanations. Now I'm ready to ask what I can do. I'm also
volunteering for documentation work, but will send out a separate email on
that.

----------------------------------------------------------------------

Doug answered in part:
"There are many approaches to parameter tweaking.  The simplest is the
"seat of the pants" approach.  Here you build an index, run some
searches, examine the results and their explanations, then tweak things
to make them better.  That's all that's been done so far, and we can
still probably make some progress this way.

The more scientific approach is to try to improve the performance of a
large set of randomly selected queries by altering parameter values.
For example, Mike has written some code which compares Nutch's results
to those of other web search engines, but hasn't yet written the
training component which does this repeatedly with different parameters.
A variation of this is to get users to hand rate hits for a large set
of queries. These sort of approaches are what we need long term."
------------------------------------------------------------------------

What do you all think I should get started on? The code works really well
now, so I don't think we want to do much tweaking on small databases--due
to scaling factors. But probably do want to get some tools/benchmarks
working so can tweak on a very large database.

1. I spent some time reading the documentation on Mike's QualityTestTool.
Code at:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/
java/net/nutch/quality/QualityTestTool.java

I liked the concept, and liked the questions and approach he used. (100
questions from Inktomi)--rating on overlap in top 10 returned results. It
would be very useful to extend this to a training approach. I could code a
training extension--would probably need some help.

2. I could spend time just doing queries, and looking for deficiencies.
I was able to find problems this way with the Yahoo Labs implementation(9 months 
old?)--
but the problems I've found have been fixed. This approach probably would not be
too productive until somebody gets a really big database. For right now
the publicly available databas(objectssource.com) are small and nutch
works very well already.

3. I could code some sort of user interface that would allow manual
tweaking of the parameters and examination of the queries. An
easy--to--use testing interface let a lot of people test and tweak.

4. I could systematically hand rate hits. I could probably do a couple of
hundred a week--and that is enough to produce useful results. This is also
pretty necessary as a reality check--even if Google and Nutch produce
similar results--it could be garbage.

Again this might better wait until it is scaled up. However--it might also
be useful to do on an intranet scale. This would be useful for intranets,
as well as giving a baseline comparison for bigger projects.

One posibility: Doug: If you could get a query set from the OSU intranet,
30-50 random questions, I could hand compare it to Google results--and
report.

Any other suggestions?
What should I start with?

Thanks
Lyle




-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to