Thanks hernan!

Stef

On Jan 17, 2013, at 5:02 PM, Hernán Morales Durand wrote:

> 
> Hi guys,
> For those working in information retrieval, for example for doing td-idf 
> ranking, you can find adapted packages: "Hapax" and "CodeFu" in the 
> BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have 
> translated some VW specific code to Pharo 1.4 (under Windows requires the 
> ProcessWrapper package) and adapted some Hapax methods to work with corpus in 
> different languages.
> 
> This is an example script for a corpus in Spanish:
> 
> | corpus tdm documents |
> 
> corpus := HXSpanishCorpus new.
> 
> documents := 'el río Danubio pasa por Viena, su color es azul
> el caudal de un río asciende en Invierno
> el río Rhin y el río Danubio tienen mucho caudal
> si un río es navegable, es porque tiene mucho caudal'.
> 
> documents lines doWithIndex: [: doc : index |
>       corpus
>               addDocument: index asString
>               with: (Terms new
>                       addString: doc
>                       using: CamelcaseScanner;
>                       yourself)].
> corpus removeStopwords.
> corpus stemAll.
> tdm := TermDocumentMatrix on: corpus.
> 
> Feel free to integrate to any repository. If you want to add a language just 
> see methods with selectors including "spanish".
> Cheers,
> 
> Hernán
> 
> 


Reply via email to