Thanks hernan! Stef
On Jan 17, 2013, at 5:02 PM, Hernán Morales Durand wrote: > > Hi guys, > For those working in information retrieval, for example for doing td-idf > ranking, you can find adapted packages: "Hapax" and "CodeFu" in the > BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have > translated some VW specific code to Pharo 1.4 (under Windows requires the > ProcessWrapper package) and adapted some Hapax methods to work with corpus in > different languages. > > This is an example script for a corpus in Spanish: > > | corpus tdm documents | > > corpus := HXSpanishCorpus new. > > documents := 'el río Danubio pasa por Viena, su color es azul > el caudal de un río asciende en Invierno > el río Rhin y el río Danubio tienen mucho caudal > si un río es navegable, es porque tiene mucho caudal'. > > documents lines doWithIndex: [: doc : index | > corpus > addDocument: index asString > with: (Terms new > addString: doc > using: CamelcaseScanner; > yourself)]. > corpus removeStopwords. > corpus stemAll. > tdm := TermDocumentMatrix on: corpus. > > Feel free to integrate to any repository. If you want to add a language just > see methods with selectors including "spanish". > Cheers, > > Hernán > >
