According to Pietro Palladino:
> Ok, I found Italian dictionary (ispell), I read the FAQ 4.10 and...surprise!!
> There's nothing about which are the name of the dictionaries to merge :-(((
Unfortunately, there's not a lot of consistency in the way dictionaries
for various languages are packaged. For many languages, the word lists
are separate and must be concatenated, sorted and uniq'ed (i.e. merged
together), but for others, the word list is complete in one file.
There's also no consistency as to which file name extension is used to
identify the word lists, so you pretty much have to look through the files
and see which are appropriate. Select lists that have just words, one per
line, with many words having a slash (/) followed by capital letter flags
(e.g. abbagliare/AFP). These are the ones that htfuzzy endings needs,
and it ignores the ones without flags. If there are multiple files like
that, take the ones you want and leave out the others, e.g. any that have
specialised jargon inappropriate for your site.
The "cat *" in the FAQ must be taken with a grain of salt, as you don't
merge all files in the dictionary package, only the word lists. We had
to say "*", though, because there's no consistent set of file names or
extensions.
> I've no idea about which they are, if the dictionaries are the result of the
> installation or they are the ones in the installation package. Could someone
> help me?
In the case of the ispell-it2001.tgz package, there's only one file,
italian.words, initially. Or, you can do a "make", which builds an
italian.words+ file, which appears to be a bit more complete. Either
will do, I think, but my gut feeling is the latter would be the better
choice.
> However, there's a funny thing I noticed... before merging the "partial
> dictionaries" in a ".0" file, I would like in trying to execute the comand "cat
> * | sort | uniq > lang.0". So I created 3 files ".txt" with few names in each
> one...
> (1.txt)
> Mario
> Gennaro
> Antonio
> (2.txt)
> Carolina
> Nicola
> nicola
> Paolo
> (3.txt)
> Pietro
> Tiziana
> daniela
> Pierpaolo
> Paola
> Antonio
> Mario
> Then I wrote: cat * | sort | uniq > all.txt and this is the result....
> (all.txt)
>
> Antonio
> Antonio
> Carolina
> daniela
> Gennaro
> Mario
> Mario
> nicola
> Nicola
> Paola
> Paolo
> Pierpaolo
> Pietro
> Tiziana
> >>> NOTE the whitespace at the beginning of the file...
> What's happened? Why uniq didn't work? I tried to add the -u flag too, but the
> result was the same :-((((
uniq worked fine, as far as I can see. The apparently duplicate lines
above have either 1 trailing space or no space at the end of the line,
so as far as uniq is concerned, they're different. uniq compares
entire lines. As for the blank line at the beginning, I suspect you
may have had an extra newline at the end of one of your .txt files.
It's a moot point, though, because you don't need to do this merging
for italian.words.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html