[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-02 Thread bar
Author: Dmitriy Kulikov
Thank you very much, it works!

Words in Cyrillic are now correctly searched in the database. 
The first lines of each found results is also now in the correct encoding. 
That's fine!

навигатор : 610 
Results 1-10 of 89 ( 0.009 seconds) 
1   Главная   [ 11.193% Popularity: 0.89705 ] 

I insert in indexer.conf:

And I set in search.htm:
  string BrowserCharset= "windows-1251";
  string LocalCharset= "UTF-8";

My MySQL client used default settins (really we still don't use UTF-8 
databases), I changed it to UTF-8 just now.
| 30летних   | 3330D0BBD0B5D182D0BDD0B8D185 |
| 3летний| 33D0BBD0B5D182D0BDD0B8D0B9   |


General mailing list

[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-02 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Hmm. It seems your MySQL is client is not configured well.
It's using latin1 as a connection character set, while the
display is onviously utf8. So it prints garbage instead of
Cyrillic letters.

You can check this using "show variables like 'character_set%';".
It seems character_set_connection is latin1.

In order to see Cyrillic letters, you can try:

- mysql --default-character-set=utf8
- or put default-character-set=utf8 into my.cnf
- or run "SET NAMES utf8" immediately after connecting

Note, this does not affect the way how indexer works.
It's only for the "mysql" client.

> The results are the same for both bases.

They are not. Hex codes are different.
The old database contains Cyrillic codes,
the new database contains something different for the same

This is wrong:

| 30летних   | 
3330C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C391E280A6 |

This is correct:
| 30летних   | 3330D0BBD0B5D182D0BDD0B8D185 |

Try adding "SetNames=utf8" in the DBAddr string in indexe.conf in the 
new database, like this:

DBAddr mysql://root@localhost/test/?SetNames=utf8

then clean the database and crawl and index again.

> mysql> use mnogosearch_new;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup with -A
> Database changed
> mysql> SELECT word, hex(word) FROM bdict WHERE word NOT RLIKE 
> '^[a-z0-9?#_]*$' LIMIT 30;
> +--+--+
> | word | hex(word)
> |
> +--+--+
> | 000в| 303030C390C2B2   
> |
> | 099в| 303939C390C2B2   
> |
> | 107рѕ  | 313037C391E282ACC391E280A2   
> |
> | 10млн | 3130C390C2BCC390C2BBC390C2BD 
> |
> | 11в | 3131C390C2B2 
> |
> | 18в | 3138C390C2B2 
> |
> | 1970Ñ…   | 31393730C391E280A6   
> |
> | 1980г   | 31393830C390C2B3 
> |
> | 1в  | 31C390C2B2   
> |
> | 1Ñ€  | 31C391E282AC 
> |
> | 2001г   | 32303031C390C2B3 
> |
> | 2002рі | 32303032C391E282ACC391E28093 
> |
> | 2004г   | 32303034C390C2B3 
> |
> | 2006г   | 32303036C390C2B3 
> |
> | 2008г   | 32303038C390C2B3 
> |
> | 2009г   | 32303039C390C2B3 
> |
> | 2009рі | 32303039C391E282ACC391E28093 
> |
> | 2011г   | 32303131C390C2B3 
> |
> | 2012рі | 32303132C391E282ACC391E28093 
> |
> | 20Ñ | 3230C391C281  
> | 30летних   | 
> 3330C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C391E280A6 |
> | 3летний| 
> 33C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C390C2B9 |
> | 40в | 3430C390C2B2 
> |
> | 41в | 3431C390C2B2 
> |
> | 48в | 3438C390C2B2 
> |
> | 599в| 353939C390C2B2   
> |
> | 59в | 3539C390C2B2 
> |
> | 600в| 363030C390C2B2   
> |
> | 60в | 3630C390C2B2 
> |
> | 90Ñ… | 3930C391E280A6   
> |
> +--+--+
> 30 rows in set (0,00 sec)
> mysql> use mnogosearch;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup wit

[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-01 Thread bar
Author: Dmitriy Kulikov
The results are the same for both bases.

mysql> use mnogosearch_new;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT word, hex(word) FROM bdict WHERE word NOT RLIKE '^[a-z0-9?#_]*$' 
| word | hex(word)  
| 000в| 303030C390C2B2 
| 099в| 303939C390C2B2 
| 107рѕ  | 313037C391E282ACC391E280A2 
| 10млн | 3130C390C2BCC390C2BBC390C2BD   
| 11в | 3131C390C2B2   
| 18в | 3138C390C2B2   
| 1970Ñ…   | 31393730C391E280A6 
| 1980г   | 31393830C390C2B3   
| 1в  | 31C390C2B2 
| 1Ñ€  | 31C391E282AC   
| 2001г   | 32303031C390C2B3   
| 2002рі | 32303032C391E282ACC391E28093   
| 2004г   | 32303034C390C2B3   
| 2006г   | 32303036C390C2B3   
| 2008г   | 32303038C390C2B3   
| 2009г   | 32303039C390C2B3   
| 2009рі | 32303039C391E282ACC391E28093   
| 2011г   | 32303131C390C2B3   
| 2012рі | 32303132C391E282ACC391E28093   
| 20Ñ | 3230C391C281
| 30летних   | 
3330C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C391E280A6 |
| 3летний| 
33C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C390C2B9 |
| 40в | 3430C390C2B2   
| 41в | 3431C390C2B2   
| 48в | 3438C390C2B2   
| 599в| 353939C390C2B2 
| 59в | 3539C390C2B2   
| 600в| 363030C390C2B2 
| 60в | 3630C390C2B2   
| 90Ñ… | 3930C391E280A6 
30 rows in set (0,00 sec)

mysql> use mnogosearch;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT word, hex(word) FROM bdict WHERE word NOT RLIKE '^[a-z0-9?#_]*$' 
| word | hex(word)|
| 000в| 303030D0B2   |
| 099в| 303939D0B2   |
| 107рѕ  | 313037D180D195   |
| 10млн | 3130D0BCD0BBD0BD |
| 11в | 3131D0B2 |
| 18в | 3138D0B2 |
| 1970Ñ…   | 31393730D185 |
| 1980г   | 31393830D0B3 |
| 1в  | 31D0B2   |
| 1Ñ€  | 31D180   |
| 2001г   | 32303031D0B3 |
| 2002рі | 32303032D180D196 |
| 2004г   | 32303034D0B3 |
| 2006г   | 32303036D0B3 |
| 2008г   | 32303038D0B3 |
| 2009г   | 32303039D0B3  

[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-01 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Can you try this one:

SELECT word, hex(word) FROM bdict WHERE word NOT RLIKE '^[a-z0-9?#_]*$' LIMIT 

The idea is to get words with Cyrillic letters and see
their HEX representation.

> I got "Empty set" for both databases.
> mysql> use mnogosearch_new;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup with -A
> Database changed
> mysql> SELECT word, hex(word) FROM bdict WHERE word RLIKE '^[^a-z]$' LIMIT 30;
> Empty set (0,02 sec)
> mysql> use mnogosearch;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup with -A
> Database changed
> mysql> SELECT word, hex(word) FROM bdict WHERE word RLIKE '^[^a-z]$' LIMIT 30;
> Empty set (0,02 sec)


General mailing list

[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-01 Thread bar
Author: Dmitriy Kulikov
I got "Empty set" for both databases.

mysql> use mnogosearch_new;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT word, hex(word) FROM bdict WHERE word RLIKE '^[^a-z]$' LIMIT 30;
Empty set (0,02 sec)

mysql> use mnogosearch;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT word, hex(word) FROM bdict WHERE word RLIKE '^[^a-z]$' LIMIT 30;
Empty set (0,02 sec)


General mailing list

[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-01 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
> Thank you!
> The problem with search.cgi was really because of the changed format 
> search.htm
> But I have problems with encodings (e.g. Cyrillic windows-1251 or UTF-8).
> I installed both versions of mnogosearch with separate bases, but with the 
> same settings.
> The old version works fine, but the new one has problems.
> Encoding settings:
> indexer.conf
>   RemoteCharset windows-1251
>   LocalCharset UTF-8
> search.htm
>   string BrowserCharset= "windows-1251";
>   string LocalCharset= "UTF-8";

Please start investigating the problem from checking data
in the database. It's important to make sure that indexer
collects data in true utf8.

What does this query return:

SELECT word, hex(word) FROM bdict WHERE word RLIKE '^[^a-z]$' LIMIT 30;


> 1) The New version requires that the base encoding by default coincided with 
> LocalCharset:
> utf8_unicode_ci;
> Otherwise, you get the message in stderr:
> An error occurred!
> DB: MySQL driver: #1267: Illegal mix of collations 
> (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='
> 2) With the same settings in  indexer.conf and search.htm  the search in the 
> Cyrillic is not working in the new version of mnogosearch.
> Setting of BrowserCharset= "UTF-8" does not change anything.
> Your search - "агент" - did not match any documents.
> Debug log:
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start UdmFind
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start Prepare
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  Prepare  
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start FindWords
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start FindWordsDB for 
> mysql://mnogosearch_new:***@localhost/mnogosearch_new/?dbmode=blob&SetNames=UTF-8
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start loading limits
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} WHERE limit loaded. 149 
> URLs found
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  loading limits   
>0.01 (149 URLs found)
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start fetching words
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start search for 
> 'агенСM-^B'
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start fetching
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  FindWordsDB: 
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start UdmQueryConvert
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  UdmQueryConvert: 
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start Excerpts
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  Excerpts:
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start WordInfo
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  WordInfo:
> May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  UdmFind: 
> 3) When searching for words in the Latin, the base gives the text fragments 
> in the correct Cyrillic, but the header of each retrieved document is always 
> issued in the wrong encoding:
> navigator : 405   
> Results 1-10 of 99 ( 0.021 seconds)
>   ?“?»?°?°??   [ 15.095% Popularity: 0.89705 ]
> ... сети Интернет по адресу: http://navigator***.ru Прежде чем приобрести ...
> I would be very grateful for help with solving the last two problems.
> Generally, when we install programs, they have the possibility of issuing 
> various warning messages.
> It would be nice if a new version of mnogosearch will warn about occurred 
> serious changes.
> I set up our old CMS to the new server and there are possible experiments. 
> But if a new version of mnogosearch will installed as one of the updates to 
> the server under working loads, then there would be a complete disaster.
> Regarding to a long hang of mnogosearch indexing.
> I found that this is due to the very slow network retrieval of large PDF 
> documents.
> I tried to set minimum limits of timeouts, but it does not help.
> MaxNetErrors 10
> ReadTimeOut 10s
> DocTimeOut 30s
> For example, I tried to set a time limit of 300s indexing, but indexing took 
> 1360s. Moreover, the document was not indexed.
> /usr/local/bin/indexer -ob -v6 -N 1 -c 300 
> /usr/local/etc/mnogosearch/indexer.conf 2> /var/log/mnogosearch.log
> --
> Done (1360 seconds, 1 documents, 11049522 bytes,  7.93 Kbytes/sec.)
> I sent you the log of attempt of indexing this one document.
> When I set: 
> Disallow *.pdf
> indexing is fast.
> Why is setting of time limits doesn't help? How can avoid such lockups of the 
> indexing process?



[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-01 Thread bar
Author: Dmitriy Kulikov
Thank you!

The problem with search.cgi was really because of the changed format search.htm
But I have problems with encodings (e.g. Cyrillic windows-1251 or UTF-8).
I installed both versions of mnogosearch with separate bases, but with the same 
The old version works fine, but the new one has problems.

Encoding settings:
  RemoteCharset windows-1251
  LocalCharset UTF-8

  string BrowserCharset= "windows-1251";
  string LocalCharset= "UTF-8";

1) The New version requires that the base encoding by default coincided with 

Otherwise, you get the message in stderr:
An error occurred!
DB: MySQL driver: #1267: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) 
and (utf8_general_ci,COERCIBLE) for operation '='

2) With the same settings in  indexer.conf and search.htm  the search in the 
Cyrillic is not working in the new version of mnogosearch.
Setting of BrowserCharset= "UTF-8" does not change anything.

Your search - "агент" - did not match any documents.

Debug log:
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start UdmFind
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start Prepare
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  Prepare
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start FindWords
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start FindWordsDB for 
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start loading limits
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} WHERE limit loaded. 149 URLs 
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  loading limits 
 0.01 (149 URLs found)
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start fetching words
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start search for 
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start fetching
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  FindWordsDB:   
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start UdmQueryConvert
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  UdmQueryConvert:   
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start Excerpts
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  Excerpts:  
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start WordInfo
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  WordInfo:  
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  UdmFind:   

3) When searching for words in the Latin, the base gives the text fragments in 
the correct Cyrillic, but the header of each retrieved document is always 
issued in the wrong encoding:
navigator : 405 
Results 1-10 of 99 ( 0.021 seconds)
?“?»?°?°??   [ 15.095% Popularity: 0.89705 ]
... сети Интернет по адресу: http://navigator***.ru Прежде чем приобрести ...

I would be very grateful for help with solving the last two problems.

Generally, when we install programs, they have the possibility of issuing 
various warning messages.
It would be nice if a new version of mnogosearch will warn about occurred 
serious changes.
I set up our old CMS to the new server and there are possible experiments. But 
if a new version of mnogosearch will installed as one of the updates to the 
server under working loads, then there would be a complete disaster.

Regarding to a long hang of mnogosearch indexing.
I found that this is due to the very slow network retrieval of large PDF 
I tried to set minimum limits of timeouts, but it does not help.
MaxNetErrors 10
ReadTimeOut 10s
DocTimeOut 30s

For example, I tried to set a time limit of 300s indexing, but indexing took 
1360s. Moreover, the document was not indexed.
/usr/local/bin/indexer -ob -v6 -N 1 -c 300 
/usr/local/etc/mnogosearch/indexer.conf 2> /var/log/mnogosearch.log
Done (1360 seconds, 1 documents, 11049522 bytes,  7.93 Kbytes/sec.)

I sent you the log of attempt of indexing this one document.

When I set: 
Disallow *.pdf
indexing is fast.

Why is setting of time limits doesn't help? How can avoid such lockups of the 
indexing process?


General mailing list

[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-05-31 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
> I just tried unsuccessfully to install and configure the mnogosearch-3.4.1 on 
> FreeBSD 10.3
> I lost a lot of time, because it turned out that "search.cgi" fundamentally 
> does not work, and without any diagnostic information.
> I many times check it out by different ways. The search base is created 
> successfully, but it is impossible to use it. 
> There is no difference when building a program from the ports or from the 
> archive on your website.
> The test script, recommended by you, gives an empty output when run in the 
> console. 
> --
> #!/bin/sh
> echo Content-Type: text/plain
> echo
> /usr/local/bin/search.cgi navigator 2>&1
> --

How does your search.htm look like?

It should start with a processing instruction, like this:

> The cgi script log contains only:
> --
> %% [Fri May 27 18:54:09 2016] GET /cgi-bin/test_search.cgi HTTP/1.1
> %% 500 /data/sites/cgi-bin/test_search.cgi
> %request
> Host: www.***
> User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:46.0) Gecko/20100101 Firefox/46.0
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3
> Accept-Encoding: gzip, deflate
> DNT: 1
> Cookie: user_city=1
> X-Compress: 1
> Proxy-Authorization: 
> d05a8777e173c2b13a81d919589dd9b2b9bf9911f681b2c82b1d3c9db748cfb33b300d2021dac648
> Connection: keep-alive
> %response
> When I try to use the search, the log contains this:
> --
> %% [Fri May 27 18:54:07 2016] GET 
> /cgi-bin/search.cgi?ul=http://www.***/&q=%EA%EE%EC%EF%E0%ED%E8%FF&ps=10&wf=2221&m=all&np=0&sy=1&sp=1&wm=wrd
>  HTTP/1.1
> %% 500 /data/sites/cgi-bin/search.cgi
> %request
> Host: www.***
> Accept: */*
> %response
> Apache24 error log contains only:
> End of script output before headers: search.cgi
> I was forced to install and use the old version of the program from your 
> website.
> Can You report the problem to the package maintainer of this FreeBSD port or 
> I must to do this?
> Additional question.
> I noticed that the program hangs for a very long time without consuming 
> system resources.
> When you start indexing, the system load is slightly increased, but it 
> decreases rapidly to zero, although the indexing process lasts a long time.
> For example, I place a limit of indexing 10min, but the program runs about 
> 12min, moreover, without consuming system resources.
> --
> #!/bin/sh
> /usr/local/mnogosearch/sbin/indexer -l -Cw 
> /usr/local/mnogosearch/etc/indexer.conf > /dev/null 2>&1
> /usr/local/mnogosearch/sbin/indexer -ob -v5 -N 1 -c 600 
> /usr/local/mnogosearch/etc/indexer.conf 2> /var/log/mnogosearch.log
> /usr/local/mnogosearch/sbin/indexer -l --index
> --
> What is the reason of this apparent anomaly?

Are you crawling some public site? Which URL does it get stuck on?

Can you please send mnogosearch.log to b...@mnogosearch.org?



General mailing list

[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-05-31 Thread bar
Author: Dmitriy Kulikov
I just tried unsuccessfully to install and configure the mnogosearch-3.4.1 on 
FreeBSD 10.3
I lost a lot of time, because it turned out that "search.cgi" fundamentally 
does not work, and without any diagnostic information.
I many times check it out by different ways. The search base is created 
successfully, but it is impossible to use it. 
There is no difference when building a program from the ports or from the 
archive on your website.

The test script, recommended by you, gives an empty output when run in the 

echo Content-Type: text/plain
/usr/local/bin/search.cgi navigator 2>&1

The cgi script log contains only:
%% [Fri May 27 18:54:09 2016] GET /cgi-bin/test_search.cgi HTTP/1.1
%% 500 /data/sites/cgi-bin/test_search.cgi
Host: www.***
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:46.0) Gecko/20100101 Firefox/46.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
DNT: 1
Cookie: user_city=1
X-Compress: 1
Connection: keep-alive

When I try to use the search, the log contains this:
%% [Fri May 27 18:54:07 2016] GET 
%% 500 /data/sites/cgi-bin/search.cgi
Host: www.***
Accept: */*

Apache24 error log contains only:
End of script output before headers: search.cgi

I was forced to install and use the old version of the program from your 
Can You report the problem to the package maintainer of this FreeBSD port or I 
must to do this?

Additional question.
I noticed that the program hangs for a very long time without consuming system 
When you start indexing, the system load is slightly increased, but it 
decreases rapidly to zero, although the indexing process lasts a long time.
For example, I place a limit of indexing 10min, but the program runs about 
12min, moreover, without consuming system resources.

/usr/local/mnogosearch/sbin/indexer -l -Cw 
/usr/local/mnogosearch/etc/indexer.conf > /dev/null 2>&1
/usr/local/mnogosearch/sbin/indexer -ob -v5 -N 1 -c 600 
/usr/local/mnogosearch/etc/indexer.conf 2> /var/log/mnogosearch.log
/usr/local/mnogosearch/sbin/indexer -l --index

What is the reason of this apparent anomaly?


General mailing list