Hi,
There was some functionality in Nutch that I've either implemented,
or am planning to implement, and I was curious if other people were
interested so that maybe the changes could get into the main line.
1. A String[] HitDetails.getValues(String field) method that
returns an array of the values. The current only returns a
single string, and Lucene indexes can have multiple values
per field.
2. In Link.java, put in a field (parentURL) for the URL of the page that
contains the link. Right now it seems we just have the links themselves
and we can't backtrack where they come from. Being able to backtrack
through the links is handy for doing something like categorization. For
example, you see that all the links are coming from a page about poodles,
so you might categorize the linked page as a poodle page. It might also
come in handy for doing something like a Google TrustRank scoring, where
you penalize certain sites if they're a known link farm, or boost them if
they're
from some place respected like DMOZ.
3. Get sorting to work on multiple fields. Lucene already works on
multiple fields so it shouldn't be difficult to get this working. Just
change the places where is passes down String field so that it
accepts an array. The sort fields could be read from the query
string in order:
search.jsp?sort=score&reverse=true&sort=date&reverse=false
Is anybody interested in these things? It would be nice to get them
merged into the main code.
Howie
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers