Hello,
I use htdig to index polish (iso-8859-2, 8bit pages) pages. I found the
following strange behaviour. When I look for the word "międzynarodówka"
(don't laugh if you know what that means) then htdig cannot find it. The
document is indexed (I can look it up by searching other words from
document). Now the most interesting part of the story. I tried to index
only that file, so I prepared the special htdig.conf file with url list
file containting the URL of this file. Then I ran:
$ htdig -vvvvv -s -i -c ./htdig.conf
[...]
Read a total of 1568 bytes
Tag: HTML
>, matched -1
Tag: HEAD
>, matched -1
Tag: TITLE
>, matched 0
word: NEP@23
word: MIĘDZYNARODÓWKA@29
Tag: /TITLE
>, matched 1
title: [NEP]: MIĘDZYNARODÓWKA,
Tag: META
http-equiv="Content-type"
content="text/html; charset=iso-8859-2"
>, matched 20
Tag: /META
>, matched -1
Tag: META
NAME="robots"
CONTENT="nofollow"
[...]
Tag: center>, matched -1
Tag: P
>, matched -1
Tag: FONT
SIZE="+2"
>, matched -1
Tag: B
>, matched -1
word: MIĘDZYNARODÓWKA@716
Tag: /B
>, matched -1
Tag: /FONT
>, matched -1
Tag: I
>, matched -1
word: Wyklęty@750
[...]
Tag: /P
>, matched -1
Tag: BR>, matched -1
Tag: /BODY
>, matched -1
Tag: /HTML
>, matched -1
head: MIĘDZYNARODÓWKA, Wyklęty powstań ludu ziemi, międzynar. hymn
proletariatu
i partii komunist.; tekst franc. E. Pottiera (drukowany 1887), muzyka P.
Degeyt
era (1888); przekład pol. nieznanego autora z końca XIX w.; do 1944 hymn
państw.
ZSRR.
size = 1568
pick: omega, # servers = 1
htdig: Run complete
htdig: 1 server seen:
htdig: omega:80 1 document
The file db.wordlist has the content:
tekst i:0 l:830 w:170
państw i:0 l:961 w:39
międzynarodó i:0 l:29 w:97384 c:2
partii i:0 l:813 w:187
pol i:0 l:910 w:90
proletariatu i:0 l:798 w:202
muzyka i:0 l:873 w:127
degeytera i:0 l:883 w:117
ludu i:0 l:765 w:235
międzynar i:0 l:783 w:217
drukowany i:0 l:857 w:143
ziemi i:0 l:770 w:230
powstań i:0 l:758 w:242
komunist i:0 l:820 w:180
wyklęty i:0 l:750 w:250
zsrr i:0 l:969 w:31
końca i:0 l:934 w:66
autora i:0 l:926 w:74
hymn i:0 l:793 w:251 c:2
nep i:0 l:23 w:97700
pottiera i:0 l:847 w:153
franc i:0 l:837 w:163
nieznanego i:0 l:915 w:85
przekład i:0 l:900 w:100
Why there is no the word międzynarodówka there. htdig knows, that
capital of ę i Ę (ó -> Ó) because it changes MIEDZYNARODÓWKA to
międzynarodówka (there is an appriopriate line in polish.aff).
The system is RH Linux 6.2
$ rpm -q htdig
htdig-3.1.5-0glibc21
$ rpm -q apache
apache-1.3.14-2.6.2
(but it shouldn't matter).
>From htdig.conf:
[...]
max_head_length: 10000
max_doc_size: 200000
[...]
locale: pl_PL
lang_dir: ${common_dir}/polish
bad_word_list: ${lang_dir}/bad_words
endings_affix_file: ${lang_dir}/polish.aff
endings_dictionary: ${lang_dir}/polish.0
endings_root2word_db: ${lang_dir}/root2word.db
endings_word2root_db: ${lang_dir}/word2root.db
The file is shtml file, but it should matter. THe head of html file is:
<HTML
><HEAD
><TITLE
>[NEP]: MIĘDZYNARODÓWKA, </TITLE
>
[...]
Then the text:
[...]
<BODY
><!--#include virtual="/moduly/header_NEP.inc" --><P
><FONT
SIZE="+2"
><B
>MIĘDZYNARODÓWKA, </B
></FONT
><I
>Wyklęty powstań ludu ziemi</I
>,
[...]
Any ideas??
Mirek
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html