> -----Mensaje original----- > De: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]]En nombre de Gilles > Detillieux > Enviado el: miercoles, 09 de enero de 2002 21:21 > Para: Neal Richter [snip] > > > > This method is pretty common in the IR research > > community. Although it is work menthining that there is > some controversy > > over whether stemming increases or decreases accuracy in general. > > Probably very dataset and desired results dependent... >
True. Stemming can increase recall, but can also reduce precision in searches, especially boolean as it's our case with htdig. Vector-space search algorithms can benefit sometimes better from stemming -- but see below. BTW 'accuracy' is a concept under debate and research. Several different measures exist today, for different applications of IR. None of them - the measures I mean :) - seems to satisfy everybody. Accuracy is always subjective, some say. > All the more reason to stick with the existing htfuzzy > framework, so that > stemming can be turned on/off at search time. The actual > stemming technique > can be applied either way - whether it's done at indexing time (within > htdig) or just after indexing by htfuzzy. The framework used > by algorithms > like soundex, metaphone and accents could be applied equally > well to the > stemming algorithms, so I'd recommend studying these > algorithms to figure > out how to add the new stemming code to the mix. > I'd like to remark that stemming - just as other 'fuzzy' techniques - is by its nature language dependant. Thus, to apply it correctly either at index time or at search time, it is necessary to identify the content language of each document. You see the implications. Otherwise, it's more a source of noise than a help, in the case of multilingual document collections. Regards, -- Quim _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
