[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-02 Thread bar
Author: Dmitriy Kulikov
Email: 
Message:
Thank you very much, it works!

Words in Cyrillic are now correctly searched in the database. 
The first lines of each found results is also now in the correct encoding. 
That's fine!

навигатор : 610 
Results 1-10 of 89 ( 0.009 seconds) 
1   Главная   [ 11.193% Popularity: 0.89705 ] 


I insert in indexer.conf:
DBAddr 
mysql://mnogosearch_new:@localhost/mnogosearch_new/?SetNames=utf8?dbmode=blob=/tmp/mysql.sock=yes

And I set in search.htm:
  string BrowserCharset= "windows-1251";
  string LocalCharset= "UTF-8";


My MySQL client used default settins (really we still don't use UTF-8 
databases), I changed it to UTF-8 just now.
| 30летних   | 3330D0BBD0B5D182D0BDD0B8D185 |
| 3летний| 33D0BBD0B5D182D0BDD0B8D0B9   |


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-02 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hmm. It seems your MySQL is client is not configured well.
It's using latin1 as a connection character set, while the
display is onviously utf8. So it prints garbage instead of
Cyrillic letters.

You can check this using "show variables like 'character_set%';".
It seems character_set_connection is latin1.

In order to see Cyrillic letters, you can try:

- mysql --default-character-set=utf8
- or put default-character-set=utf8 into my.cnf
- or run "SET NAMES utf8" immediately after connecting

Note, this does not affect the way how indexer works.
It's only for the "mysql" client.


> The results are the same for both bases.

They are not. Hex codes are different.
The old database contains Cyrillic codes,
the new database contains something different for the same
strings:


This is wrong:

| 30летних   | 
3330C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C391E280A6 |

This is correct:
| 30летних   | 3330D0BBD0B5D182D0BDD0B8D185 |



Try adding "SetNames=utf8" in the DBAddr string in indexe.conf in the 
new database, like this:

DBAddr mysql://root@localhost/test/?SetNames=utf8

then clean the database and crawl and index again.


> 
> mysql> use mnogosearch_new;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup with -A
> Database changed
> mysql> SELECT word, hex(word) FROM bdict WHERE word NOT RLIKE 
> '^[a-z0-9?#_]*$' LIMIT 30;
> +--+--+
> | word | hex(word)
> |
> +--+--+
> | 000в| 303030C390C2B2   
> |
> | 099в| 303939C390C2B2   
> |
> | 107рѕ  | 313037C391E282ACC391E280A2   
> |
> | 10млн | 3130C390C2BCC390C2BBC390C2BD 
> |
> | 11в | 3131C390C2B2 
> |
> | 18в | 3138C390C2B2 
> |
> | 1970Ñ…   | 31393730C391E280A6   
> |
> | 1980г   | 31393830C390C2B3 
> |
> | 1в  | 31C390C2B2   
> |
> | 1Ñ€  | 31C391E282AC 
> |
> | 2001г   | 32303031C390C2B3 
> |
> | 2002рі | 32303032C391E282ACC391E28093 
> |
> | 2004г   | 32303034C390C2B3 
> |
> | 2006г   | 32303036C390C2B3 
> |
> | 2008г   | 32303038C390C2B3 
> |
> | 2009г   | 32303039C390C2B3 
> |
> | 2009рі | 32303039C391E282ACC391E28093 
> |
> | 2011г   | 32303131C390C2B3 
> |
> | 2012рі | 32303132C391E282ACC391E28093 
> |
> | 20Ñ | 3230C391C281  
>|
> | 30летних   | 
> 3330C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C391E280A6 |
> | 3летний| 
> 33C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C390C2B9 |
> | 40в | 3430C390C2B2 
> |
> | 41в | 3431C390C2B2 
> |
> | 48в | 3438C390C2B2 
> |
> | 599в| 353939C390C2B2   
> |
> | 59в | 3539C390C2B2 
> |
> | 600в| 363030C390C2B2   
> |
> | 60в | 3630C390C2B2 
> |
> | 90Ñ… | 3930C391E280A6   
> |
> +--+--+
> 30 rows in set (0,00 sec)
> 
> 
> 
> mysql> use mnogosearch;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup 

[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-01 Thread bar
Author: Dmitriy Kulikov
Email: 
Message:
The results are the same for both bases.

mysql> use mnogosearch_new;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT word, hex(word) FROM bdict WHERE word NOT RLIKE '^[a-z0-9?#_]*$' 
LIMIT 30;
+--+--+
| word | hex(word)  
  |
+--+--+
| 000в| 303030C390C2B2 
  |
| 099в| 303939C390C2B2 
  |
| 107рѕ  | 313037C391E282ACC391E280A2 
  |
| 10млн | 3130C390C2BCC390C2BBC390C2BD   
  |
| 11в | 3131C390C2B2   
  |
| 18в | 3138C390C2B2   
  |
| 1970Ñ…   | 31393730C391E280A6 
  |
| 1980г   | 31393830C390C2B3   
  |
| 1в  | 31C390C2B2 
  |
| 1Ñ€  | 31C391E282AC   
  |
| 2001г   | 32303031C390C2B3   
  |
| 2002рі | 32303032C391E282ACC391E28093   
  |
| 2004г   | 32303034C390C2B3   
  |
| 2006г   | 32303036C390C2B3   
  |
| 2008г   | 32303038C390C2B3   
  |
| 2009г   | 32303039C390C2B3   
  |
| 2009рі | 32303039C391E282ACC391E28093   
  |
| 2011г   | 32303131C390C2B3   
  |
| 2012рі | 32303132C391E282ACC391E28093   
  |
| 20Ñ | 3230C391C281
 |
| 30летних   | 
3330C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C391E280A6 |
| 3летний| 
33C390C2BBC390C2B5C391E2809AC390C2BDC390C2B8C390C2B9 |
| 40в | 3430C390C2B2   
  |
| 41в | 3431C390C2B2   
  |
| 48в | 3438C390C2B2   
  |
| 599в| 353939C390C2B2 
  |
| 59в | 3539C390C2B2   
  |
| 600в| 363030C390C2B2 
  |
| 60в | 3630C390C2B2   
  |
| 90Ñ… | 3930C391E280A6 
  |
+--+--+
30 rows in set (0,00 sec)



mysql> use mnogosearch;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT word, hex(word) FROM bdict WHERE word NOT RLIKE '^[a-z0-9?#_]*$' 
LIMIT 30;
+--+--+
| word | hex(word)|
+--+--+
| 000в| 303030D0B2   |
| 099в| 303939D0B2   |
| 107рѕ  | 313037D180D195   |
| 10млн | 3130D0BCD0BBD0BD |
| 11в | 3131D0B2 |
| 18в | 3138D0B2 |
| 1970Ñ…   | 31393730D185 |
| 1980г   | 31393830D0B3 |
| 1в  | 31D0B2   |
| 1Ñ€  | 31D180   |
| 2001г   | 32303031D0B3 |
| 2002рі | 32303032D180D196 |
| 2004г   | 32303034D0B3 |
| 2006г   | 32303036D0B3 |
| 2008г   | 32303038D0B3 |
| 2009г   | 32303039D0B3  

[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-01 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Can you try this one:

SELECT word, hex(word) FROM bdict WHERE word NOT RLIKE '^[a-z0-9?#_]*$' LIMIT 
30;

The idea is to get words with Cyrillic letters and see
their HEX representation.



> I got "Empty set" for both databases.
> 
> mysql> use mnogosearch_new;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup with -A
> Database changed
> mysql> SELECT word, hex(word) FROM bdict WHERE word RLIKE '^[^a-z]$' LIMIT 30;
> Empty set (0,02 sec)
> 
> 
> mysql> use mnogosearch;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup with -A
> Database changed
> mysql> SELECT word, hex(word) FROM bdict WHERE word RLIKE '^[^a-z]$' LIMIT 30;
> Empty set (0,02 sec)
> 

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-01 Thread bar
Author: Dmitriy Kulikov
Email: 
Message:
I got "Empty set" for both databases.

mysql> use mnogosearch_new;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT word, hex(word) FROM bdict WHERE word RLIKE '^[^a-z]$' LIMIT 30;
Empty set (0,02 sec)


mysql> use mnogosearch;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT word, hex(word) FROM bdict WHERE word RLIKE '^[^a-z]$' LIMIT 30;
Empty set (0,02 sec)


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-06-01 Thread bar
Author: Dmitriy Kulikov
Email: 
Message:
Thank you!

The problem with search.cgi was really because of the changed format search.htm
But I have problems with encodings (e.g. Cyrillic windows-1251 or UTF-8).
I installed both versions of mnogosearch with separate bases, but with the same 
settings.
The old version works fine, but the new one has problems.

Encoding settings:
indexer.conf
  RemoteCharset windows-1251
  LocalCharset UTF-8

search.htm
  string BrowserCharset= "windows-1251";
  string LocalCharset= "UTF-8";


1) The New version requires that the base encoding by default coincided with 
LocalCharset:
ALTER DATABASE `mnogosearch_new` DEFAULT CHARACTER SET utf8 COLLATE 
utf8_unicode_ci;

Otherwise, you get the message in stderr:
An error occurred!
DB: MySQL driver: #1267: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) 
and (utf8_general_ci,COERCIBLE) for operation '='


2) With the same settings in  indexer.conf and search.htm  the search in the 
Cyrillic is not working in the new version of mnogosearch.
Setting of BrowserCharset= "UTF-8" does not change anything.

Your search - "агент" - did not match any documents.

Debug log:
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start UdmFind
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start Prepare
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  Prepare
 0.00
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start FindWords
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start FindWordsDB for 
mysql://mnogosearch_new:***@localhost/mnogosearch_new/?dbmode=blob=UTF-8
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start loading limits
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} WHERE limit loaded. 149 URLs 
found
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  loading limits 
 0.01 (149 URLs found)
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start fetching words
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start search for 
'агенСM-^B'
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start fetching
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  FindWordsDB:   
 0.01
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start UdmQueryConvert
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  UdmQueryConvert:   
 0.00
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start Excerpts
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  Excerpts:  
 0.00
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Start WordInfo
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  WordInfo:  
 0.00
May 31 22:31:10 *** search.cgi[79240]: [79240]{--} Stop  UdmFind:   
 0.01


3) When searching for words in the Latin, the base gives the text fragments in 
the correct Cyrillic, but the header of each retrieved document is always 
issued in the wrong encoding:
navigator : 405 
Results 1-10 of 99 ( 0.021 seconds)
?“?»?°?°??   [ 15.095% Popularity: 0.89705 ]
... сети Интернет по адресу: http://navigator***.ru Прежде чем приобрести ...



I would be very grateful for help with solving the last two problems.

Generally, when we install programs, they have the possibility of issuing 
various warning messages.
It would be nice if a new version of mnogosearch will warn about occurred 
serious changes.
I set up our old CMS to the new server and there are possible experiments. But 
if a new version of mnogosearch will installed as one of the updates to the 
server under working loads, then there would be a complete disaster.



Regarding to a long hang of mnogosearch indexing.
I found that this is due to the very slow network retrieval of large PDF 
documents.
I tried to set minimum limits of timeouts, but it does not help.
MaxNetErrors 10
ReadTimeOut 10s
DocTimeOut 30s

For example, I tried to set a time limit of 300s indexing, but indexing took 
1360s. Moreover, the document was not indexed.
/usr/local/bin/indexer -ob -v6 -N 1 -c 300 
/usr/local/etc/mnogosearch/indexer.conf 2> /var/log/mnogosearch.log
--
Done (1360 seconds, 1 documents, 11049522 bytes,  7.93 Kbytes/sec.)

I sent you the log of attempt of indexing this one document.

When I set: 
Disallow *.pdf
indexing is fast.

Why is setting of time limits doesn't help? How can avoid such lockups of the 
indexing process?


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-05-31 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> I just tried unsuccessfully to install and configure the mnogosearch-3.4.1 on 
> FreeBSD 10.3
> I lost a lot of time, because it turned out that "search.cgi" fundamentally 
> does not work, and without any diagnostic information.
> I many times check it out by different ways. The search base is created 
> successfully, but it is impossible to use it. 
> There is no difference when building a program from the ports or from the 
> archive on your website.
> 
> 
> The test script, recommended by you, gives an empty output when run in the 
> console. 
> --
> #!/bin/sh
> 
> echo Content-Type: text/plain
> echo
> /usr/local/bin/search.cgi navigator 2>&1
> --

How does your search.htm look like?

It should start with a processing instruction, like this:


 
> 
> The cgi script log contains only:
> --
> %% [Fri May 27 18:54:09 2016] GET /cgi-bin/test_search.cgi HTTP/1.1
> %% 500 /data/sites/cgi-bin/test_search.cgi
> %request
> Host: www.***
> User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:46.0) Gecko/20100101 Firefox/46.0
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3
> Accept-Encoding: gzip, deflate
> DNT: 1
> Cookie: user_city=1
> X-Compress: 1
> Proxy-Authorization: 
> d05a8777e173c2b13a81d919589dd9b2b9bf9911f681b2c82b1d3c9db748cfb33b300d2021dac648
> Connection: keep-alive
> %response
> 
> 
> When I try to use the search, the log contains this:
> --
> %% [Fri May 27 18:54:07 2016] GET 
> /cgi-bin/search.cgi?ul=http://www.***/=%EA%EE%EC%EF%E0%ED%E8%FF=10=2221=all=0=1=1=wrd
>  HTTP/1.1
> %% 500 /data/sites/cgi-bin/search.cgi
> %request
> Host: www.***
> Accept: */*
> %response
> 
> 
> Apache24 error log contains only:
> End of script output before headers: search.cgi
> 
> 
> I was forced to install and use the old version of the program from your 
> website.
> Can You report the problem to the package maintainer of this FreeBSD port or 
> I must to do this?
> 
> 
> 
> Additional question.
> I noticed that the program hangs for a very long time without consuming 
> system resources.
> When you start indexing, the system load is slightly increased, but it 
> decreases rapidly to zero, although the indexing process lasts a long time.
> For example, I place a limit of indexing 10min, but the program runs about 
> 12min, moreover, without consuming system resources.
> --
> #!/bin/sh
> 
> /usr/local/mnogosearch/sbin/indexer -l -Cw 
> /usr/local/mnogosearch/etc/indexer.conf > /dev/null 2>&1
> /usr/local/mnogosearch/sbin/indexer -ob -v5 -N 1 -c 600 
> /usr/local/mnogosearch/etc/indexer.conf 2> /var/log/mnogosearch.log
> /usr/local/mnogosearch/sbin/indexer -l --index
> --
> 
> What is the reason of this apparent anomaly?

Are you crawling some public site? Which URL does it get stuck on?

Can you please send mnogosearch.log to b...@mnogosearch.org?

Thanks.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: mnogosearch-3.4.1 on FreeBSD 10.3

2016-05-31 Thread bar
Author: Dmitriy Kulikov
Email: 
Message:
I just tried unsuccessfully to install and configure the mnogosearch-3.4.1 on 
FreeBSD 10.3
I lost a lot of time, because it turned out that "search.cgi" fundamentally 
does not work, and without any diagnostic information.
I many times check it out by different ways. The search base is created 
successfully, but it is impossible to use it. 
There is no difference when building a program from the ports or from the 
archive on your website.


The test script, recommended by you, gives an empty output when run in the 
console. 
--
#!/bin/sh

echo Content-Type: text/plain
echo
/usr/local/bin/search.cgi navigator 2>&1
--


The cgi script log contains only:
--
%% [Fri May 27 18:54:09 2016] GET /cgi-bin/test_search.cgi HTTP/1.1
%% 500 /data/sites/cgi-bin/test_search.cgi
%request
Host: www.***
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:46.0) Gecko/20100101 Firefox/46.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
DNT: 1
Cookie: user_city=1
X-Compress: 1
Proxy-Authorization: 
d05a8777e173c2b13a81d919589dd9b2b9bf9911f681b2c82b1d3c9db748cfb33b300d2021dac648
Connection: keep-alive
%response


When I try to use the search, the log contains this:
--
%% [Fri May 27 18:54:07 2016] GET 
/cgi-bin/search.cgi?ul=http://www.***/=%EA%EE%EC%EF%E0%ED%E8%FF=10=2221=all=0=1=1=wrd
 HTTP/1.1
%% 500 /data/sites/cgi-bin/search.cgi
%request
Host: www.***
Accept: */*
%response


Apache24 error log contains only:
End of script output before headers: search.cgi


I was forced to install and use the old version of the program from your 
website.
Can You report the problem to the package maintainer of this FreeBSD port or I 
must to do this?



Additional question.
I noticed that the program hangs for a very long time without consuming system 
resources.
When you start indexing, the system load is slightly increased, but it 
decreases rapidly to zero, although the indexing process lasts a long time.
For example, I place a limit of indexing 10min, but the program runs about 
12min, moreover, without consuming system resources.
--
#!/bin/sh

/usr/local/mnogosearch/sbin/indexer -l -Cw 
/usr/local/mnogosearch/etc/indexer.conf > /dev/null 2>&1
/usr/local/mnogosearch/sbin/indexer -ob -v5 -N 1 -c 600 
/usr/local/mnogosearch/etc/indexer.conf 2> /var/log/mnogosearch.log
/usr/local/mnogosearch/sbin/indexer -l --index
--

What is the reason of this apparent anomaly?



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general