Re: [htdig] Does htmerge remove URL from database ?

2000-11-30 Thread Olivier Korn

At 22:07 25/11/2000 -0600, Geoff Hutchison wrote:
At 2:21 PM +0100 11/23/00, Olivier Korn wrote:
[snip]
Some of the web hosts are case sensitives and some are not. Could it be 
the source of my problem ?

I wouldn't think so. But you have to be pretty careful that the URL 
encodings are shared between your site.conf files. Personally, I make up a 
"main.conf," include that in the other files and only set the start_url 
and a minimal number of things in the individual site.conf files. In 
particular, it makes it easy to change something in all config files at once.

I'm not sure about what do you mean by "to be careful that the URL 
encodings are shared between your site.conf files" ?

Each of my site#.conf contains this "minimal number of things" :
database_base:  ${database_dir}/site#
start_url:  http://www.site#.fr/somepath/
limit_urls_to:  ${start_url}# or something else (it depends on 
the site #)
case_sensitive: true# or false (it depends on the site #)
remove_default_doc: default.htm # or something else, it depends on...
 # ... the site # ! (you guessed ;-)
include:${config_dir}/_commun_include

And that's all (everything else is in _commun_include and is the same for 
each site #)

Well... How could I be sure that "the URL encodings are shared between my 
site#.conf files" ?

Regards,
Olivier Korn.
Strasbourg, France.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Does htmerge remove URL from database ?

2000-11-30 Thread Olivier Korn

At 09:30 27/11/2000 +, David Adams wrote:
I found that the extra runs of htmerge were necessary when I was merging two
runs of htdig.  Unless I ran both databases through htmerge before merging
them I was getting

Deleted, invalid:

I never had this problem.

against some pages in the htmerge run.  Compared to the time required to run
htdig, the extra htmerge runs are trivial, so you have little to loose by
including them.

And this is what I've done but with no success.

Use the -v option with both htdig and htmerge and see if you get any message
re the pages that don't appear in the final index.

I've got to try this out...


Olivier Korn.
Strasbourg, France.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Does htmerge remove URL from database ?

2000-11-27 Thread David Adams

I found that the extra runs of htmerge were necessary when I was merging two
runs of htdig.  Unless I ran both databases through htmerge before merging
them I was getting

Deleted, invalid:

against some pages in the htmerge run.  Compared to the time required to run
htdig, the extra htmerge runs are trivial, so you have little to loose by
including them.

Use the -v option with both htdig and htmerge and see if you get any message
re the pages that don't appear in the final index.


- Original Message -
From: "Geoff Hutchison" [EMAIL PROTECTED]
To: "Olivier Korn" [EMAIL PROTECTED]
Cc: "Gilles Detillieux" [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Sunday, November 26, 2000 4:07 AM
Subject: Re: [htdig] Does htmerge remove URL from database ?


 At 2:21 PM +0100 11/23/00, Olivier Korn wrote:
 I tried it and it didn't solve the problem. BTW, I don't think that
 these extra merges are necessary either.

 No, they should not be at all necessary unless there's truly
 something horrific wrong with the merging code--it only uses the
 files directly output from htdig. (My idea was that it would be
 faster if you didn't need to run htmerge on intermediate DB.)

 Now, I run :
 htmerge -c site#.conf
 then
 htmerge -c site1.conf -m site#.conf (with #  1)
 
 If I then run
 htsearch -c site5.conf with words="rénovation tourisme", it finds
 the document (in first place.)
 But if I do
 htsearch -c site1.conf with the same words, it returns the "nomatch"
document.
 
 Some of the web hosts are case sensitives and some are not. Could it
 be the source of my problem ?

 I wouldn't think so. But you have to be pretty careful that the URL
 encodings are shared between your site.conf files. Personally, I make
 up a "main.conf," include that in the other files and only set the
 start_url and a minimal number of things in the individual site.conf
 files. In particular, it makes it easy to change something in all
 config files at once.

 --
 -Geoff Hutchison
 Williams Students Online
 http://wso.williams.edu/

 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Does htmerge remove URL from database ?

2000-11-25 Thread Geoff Hutchison

At 2:21 PM +0100 11/23/00, Olivier Korn wrote:
I tried it and it didn't solve the problem. BTW, I don't think that 
these extra merges are necessary either.

No, they should not be at all necessary unless there's truly 
something horrific wrong with the merging code--it only uses the 
files directly output from htdig. (My idea was that it would be 
faster if you didn't need to run htmerge on intermediate DB.)

Now, I run :
htmerge -c site#.conf
then
htmerge -c site1.conf -m site#.conf (with #  1)

If I then run
htsearch -c site5.conf with words="rénovation tourisme", it finds 
the document (in first place.)
But if I do
htsearch -c site1.conf with the same words, it returns the "nomatch" document.

Some of the web hosts are case sensitives and some are not. Could it 
be the source of my problem ?

I wouldn't think so. But you have to be pretty careful that the URL 
encodings are shared between your site.conf files. Personally, I make 
up a "main.conf," include that in the other files and only set the 
start_url and a minimal number of things in the individual site.conf 
files. In particular, it makes it easy to change something in all 
config files at once.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Does htmerge remove URL from database ?

2000-11-23 Thread Olivier Korn

At 12:35 22/11/2000 -0600, Gilles Detillieux wrote:
  4. After all the sites have been htdigged, I run htmerge in sequence in
  order to merge all the small databases into one.
  First call is "htmerge -c site1.conf", subsequents call are "htmerge -c
  site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and 
 so on.)
...
  2. Now let's hear the amazing part of my story. If I do a "htmerge -c
  site5.conf" (notice there is no -m this time.) and if I htsearch -c
  site5.conf with "rénovation tourisme" my document is said to be found !
  Said in another way, the document was indexed but was certainly ripped out
  when merging with another database.

I think after each separate htdig -i -c site#.conf you should run a
separate htmerge -c site#.conf, not just on the first site, before you
merge everything together.  Try that and see if it solves the problem.
I think the intention was that these extra merges should not have been
necessary, but this has come up before, and I think there's a problem
with merging multiple DBs when they haven't already been cleaned up by
a simple htmerge.

I tried it and it didn't solve the problem. BTW, I don't think that these 
extra merges are necessary either.

Now, I run :
htmerge -c site#.conf
then
htmerge -c site1.conf -m site#.conf (with #  1)

If I then run
htsearch -c site5.conf with words="rénovation tourisme", it finds the 
document (in first place.)
But if I do
htsearch -c site1.conf with the same words, it returns the "nomatch" document.

Some of the web hosts are case sensitives and some are not. Could it be the 
source of my problem ?

What are the rules for htmerge ? When does it really remove URLs from 
database ?

--
Olivier Korn
Strasbourg, France.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] Does htmerge remove URL from database ?

2000-11-22 Thread Olivier Korn

Hi,

We were using ht://Dig for many months now and we didn't have to complain 
about it but... There is something strange that I don't understand.

The way, we're using ht://Dig is described here :

1. We have 20 or so web sites named, say, http://www.site1.fr/a-path/, 
http://www.site2.fr/a-path-which-does-not-read-the-same-as-site1/, and so 
on. Some are MS-IIS, some are Linux/Apache hosted.

2. For each of these sites, I made up a site1.conf, site2.conf, (and so on) 
containing start_url, restrict thing, (and so on.) Each of these .conf 
includes a file named "_commun_include". Of course, I changed database 
prefix for each of the sites.

3. Once a week, htdig is called on each site with "htdig -i -c site1.conf" 
then "htdig -i -c site2.conf", (and so on.)

4. After all the sites have been htdigged, I run htmerge in sequence in 
order to merge all the small databases into one.
First call is "htmerge -c site1.conf", subsequents call are "htmerge -c 
site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and so on.)

5. Everything seems to work perfectly. Using htsearch, I can find documents 
which are on any of the sites. Let's note for later that my locale is 
correctly set so I don't have any problem with accents (I also use the 
accents patch which works fine.) (I say all this because of the example I 
give below.) ("htfuzzy accents" is run after all the htmerge.)


Here is the problem :

1. On site5, there is an HTML document named "Rénovation du BTS tourisme".
When searching for "rénovation tourisme" (method=and) the document is not 
found (ht://Dig even says there is no document containing these words.) 
Using the "restrict=http://www.site5.fr/site5-path-to-docs/" parameter 
doesn't change anything (this is not a surprise but... I wanted to be sure.)

2. Now let's hear the amazing part of my story. If I do a "htmerge -c 
site5.conf" (notice there is no -m this time.) and if I htsearch -c 
site5.conf with "rénovation tourisme" my document is said to be found ! 
Said in another way, the document was indexed but was certainly ripped out 
when merging with another database.


Well, I'd like to know if somebody already ran into this particular problem 
or if it is a "feature" of htmerge (deleting entry when merging two 
databases together.) What can I do against it ?

I'm really confused about all of this (this state of mind doesn't help me 
to write correct english. Sorry about that.)

--
Olivier Korn
Strasbourg, France.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Does htmerge remove URL from database ?

2000-11-22 Thread Gilles Detillieux

According to Olivier Korn:
 3. Once a week, htdig is called on each site with "htdig -i -c site1.conf" 
 then "htdig -i -c site2.conf", (and so on.)
 
 4. After all the sites have been htdigged, I run htmerge in sequence in 
 order to merge all the small databases into one.
 First call is "htmerge -c site1.conf", subsequents call are "htmerge -c 
 site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and so on.)
...
 2. Now let's hear the amazing part of my story. If I do a "htmerge -c 
 site5.conf" (notice there is no -m this time.) and if I htsearch -c 
 site5.conf with "rénovation tourisme" my document is said to be found ! 
 Said in another way, the document was indexed but was certainly ripped out 
 when merging with another database.

I think after each separate htdig -i -c site#.conf you should run a
separate htmerge -c site#.conf, not just on the first site, before you
merge everything together.  Try that and see if it solves the problem.
I think the intention was that these extra merges should not have been
necessary, but this has come up before, and I think there's a problem
with merging multiple DBs when they haven't already been cleaned up by
a simple htmerge.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html