Hello,

I was reading the code and implementing some features today and want to
summarize it as I promised to Andrzej and Michael - my email is a bit long but I have promised some details.


Status of related features in current nutch codebase:
- "site" field added by SiteIndexingFilter cannot be used for hostname storage as it is not tokenized and as I understand the purpose of this plugin (limiting answers to given site) it should not be tokenized. And we need to tokenize host.
- there is a "title" field added by index-basic plugin but it is not indexed - it is stored only for display purposes.


There are two sets of changes required to add host and title fields to
the index and use them during search.

Indexing changes:

    -index-basic plugin:
    I assume index-basic functionality is to be changed to include
indexed,tokenized,unstored "host" and indexed,tokenized,stored "title"
fields and exclude title from "anchor" field.

- NutchDocumentAnalyzer:
- for "host" and "title" use the same analyzer as for "anchor" and "url".



- NutchSimilarity:
- length normalization should treat host as url and title as anchor for now.



Searching:
- BasicQueryFilter -
- add host and title fields handled exactly as all other fields. For start I will set TITLE_BOOST=1.5, and HOST_BOOST=2 (as host would be used in matching two times: in "host" and in "url" fields - it will influence the score very much).After implementation I will do some test to choose the values for boost that would look ok (at least for me).



I have already implemented all these changes (not a lot of work after figuring what to change in fact) and I will do basic tests tommorow, and after basic verification of implementation I will send the patch for others interested to try - and comment on results.


Changes that are introduced by this patch would modify index structure (addition of new field) and will change default query. I think it should be possible to use new code with old index (it should behave as old code as new fields in query would not be present in document), but mixing new and old segments might be a problem. So I think this change requires reindexing.


During implementation I have found two additional ideas: 1) Do not index url (keep it as stored only field) - add separate host and path fields as indexed (it will not index protocol, port and some other parts of url but I am not sure if indexing them makes sense). It will be easier to control effect of weights and length normalization if host is not counted twice, but this would require reindexing as some old fields would be used differently in query - so it will not work as before with old index.

2)I do not have any evidence yet, but looking at the data I have a
feeling that "not host" part of an url is not as important as current
boost factor for it indicates. Probably it should be treated more like a
title (as it is settable by page owner and easy to spam). I will look at
paramters when I will have tested implementation so I can index the same
segments with different parameters and compare results.

Do you think it makes sense to add such functionality? If so I can change these two additional things before posting a patch.

Regards,
Piotr


------------------------------------------------------- This SF.net email is sponsored by: 2005 Windows Mobile Application Contest Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones for the chance to win $25,000 and application distribution. Enter today at http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to