Re: [Nutch-dev] Adding title and site to scoring

Piotr Kosiorowski Wed, 23 Mar 2005 06:18:48 -0800

Hello,

I was reading the code and implementing some features today and want to summarize it as I promised to Andrzej and Michael - my email is a bit long but I have promised some details.

Status of related features in current nutch codebase: - "site" field added by SiteIndexingFilter cannot be used for hostname storage as it is not tokenized and as I understand the purpose of this plugin (limiting answers to given site) it should not be tokenized. And we need to tokenize host. - there is a "title" field added by index-basic plugin but it is not indexed - it is stored only for display purposes.

There are two sets of changes required to add host and title fields to
the index and use them during search.

Indexing changes:

    -index-basic plugin:
    I assume index-basic functionality is to be changed to include
indexed,tokenized,unstored "host" and indexed,tokenized,stored "title"
fields and exclude title from "anchor" field.

- NutchDocumentAnalyzer: - for "host" and "title" use the same analyzer as for "anchor" and "url".

- NutchSimilarity: - length normalization should treat host as url and title as anchor for now.

Searching: - BasicQueryFilter - - add host and title fields handled exactly as all other fields. For start I will set TITLE_BOOST=1.5, and HOST_BOOST=2 (as host would be used in matching two times: in "host" and in "url" fields - it will influence the score very much).After implementation I will do some test to choose the values for boost that would look ok (at least for me).


I have already implemented all these changes (not a lot of work after
figuring what to change in fact) and I will do basic tests tommorow, and
after basic verification of implementation I will send the patch for
others interested to try - and comment on results.


Changes that are introduced by this patch would modify index structure
(addition of new field) and will change default query. I think it should
be possible to use new code with old index (it should behave as old code
as new fields in query would not be present in document), but mixing new
and old segments might be a problem. So I think this change requires
reindexing.


During implementation I have found two additional ideas:
1) Do not index url (keep it as stored only field) - add separate host
and path fields as indexed  (it will not index protocol, port and some
other parts of url but I am not sure if indexing them makes sense). It
will be easier to control effect of weights and length normalization if
host is not counted twice, but this would require reindexing as some old
fields would be used differently in query - so it will not work as
before with old index.

2)I do not have any evidence yet, but looking at the data I have a
feeling that "not host" part of an url is not as important as current
boost factor for it indicates. Probably it should be treated more like a
title (as it is settable by page owner and easy to spam). I will look at
paramters when I will have tested implementation so I can index the same
segments with different parameters and compare results.

Do you think it makes sense to add such functionality? If so I can change these two additional things before posting a patch.

Regards,
Piotr


-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Adding title and site to scoring

Reply via email to