According to Radoy Pavlov:
> Gilles Detillieux wrote:
> > By "polish doesn't work", do you mean you're having problems indexing
> > the Polish text with htdig, as well as having problems with building
> > the Polish endings database?  The endings database is only used for
> > the "endings" fuzzy match algorithm, and isn't absolutely essential.
> > 
> Yes, I'm having problems indexing Polish text and I can't build endings
> database aswell. I'm running: htdig -i -vvv -c /path/to/htdig_pl.conf
> The robot goes thru all Polish pages. No database.

That's very bizarre!  If it goes through all the pages, there really
should be a database created.  Is htdig finding words in these pages?
Try running "htdig -i -vvvv -c /path/to/htdig_pl.conf" (one extra
-v option) to get feedback on word parsing.  Is there a db.wordlist
created in your database_dir?  If all the words in all the documents
contained lots of accented letters, such that htdig never saw three or
more unaccented letters in a row, and if the LC_CTYPE table for your
locale isn't working, it's conceivable that htdig would not find a single
word longer than minimum_word_length, but that seems pretty unlikely.
There must be something else odd happing here.

> From my conf file:
> 
> locale:               pl_PL.ISO_8859-2
> lang_dir:             ${common_dir}/polish
> # bad_word_list:        ${lang_dir}/bad_words
> endings_affix_file:   ${lang_dir}/polish.aff
> endings_dictionary:   ${lang_dir}/polish.0
> endings_root2word_db: ${lang_dir}/root2word.db
> endings_word2root_db: ${lang_dir}/word2root.db
> 
> > If polish accented letters aren't indexed properly, then it may be
> > because the pl_PL locale on your system doesn't define a proper
> > LC_CTYPE file for the ISO-8859-2 character set.  If they are indexed
> > properly, you could simply take the endings algorithm out of your
> > search_algorithms attribute setting until you manage to build your
> > endings database.
> 
> ls -al /usr/share/locale/ | grep pl_
> drwxr-xr-x   2 root     wheel         512 Feb 22  2000 pl_PL.ISO_8859-2
> 
> in /usr/share/locale/pl_PL.ISO_8859-2
> lrwxrwxrwx   1 root     wheel          30 Feb 22  2000 LC_COLLATE ->
> ../lt_LN.ISO_8859-2/LC_COLLATE
> lrwxrwxrwx   1 root     wheel          28 Feb 22  2000 LC_CTYPE ->
> ../lt_LN.ISO_8859-2/LC_CTYPE
> -rw-r--r--   1 root     wheel         285 Dec 28  1999 LC_TIME
> 
> That's ok, isn't it ?

It certainly looks OK from the file listing.  Is the lt_LN.ISO_8859-2
installed on your system, for the symbolic links to work?  Of course,
the presence of the LC_CTYPE file doesn't guarantee that it's correct,
but it does rule out an undefined locale.

> > With htfuzzy -vv, you should be getting much more output than that.
> > Is there anything in your polish.0 file?  You should get a message
> > for each word processed from that file.
> 
> The output is just the same. I can see rich output for any of the other
> 6 languages on my site. Everything is just fine, not with Polish.
> 
> cat polish.0 | wc -l
> 52038
> 
> cat polish.aff | wc -l
> 4735

Can you send in these two files, for someone else to try htfuzzy on them?

By the way, on which platform are you running htdig?  (I.e. which OS version?
Which distribution version if it's Linux?)  Also, which version of htdig
are you running?  Did you build it yourself from source?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to