[htdig] Rundig

1999-11-25 Thread Jason Carvalho

When I run 'rundig', it crawls my web site then when it comes to the
merge stage, it outputs:

Deleted, no excerpt :2156 http://ww...etc.   for loads of my pages.

All in all, it found about 9500 pages but only merged 7500, giving the
above message for the rest.

What does this mean?

-- 
--
Jason Carvalho
Web Analyst
Cranfield University
[EMAIL PROTECTED]
--


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] Reducing the importance of pages.

1999-11-25 Thread David Adams

 
 Is it possible to reduce the importance of certain pages?  We have
 some pages on our site that are directories and contain thousands of
 entries.  As a result they always seem to come up as top results
 whenever we search for anything.  I don't really want to remove these
 pages from a search but I would like them tol appear lower down the
 list.  Is this at all possible (perhaps by using negative weighting or
 similar?)?
 
 Thanks!
 
 -- 
 --
 Jason Carvalho
 Web Analyst
 Cranfield University
 [EMAIL PROTECTED]

You could increase the weighting of other pages by encouraging
the use of

META NAME="keywords" CONTENT="...list of keywords..."

and

META NAME="description" CONTENT="...relevent text..."

in their headers.  On our site we have increased the weighting
of keywords to 200.

You might consider not indexing the directory pages atall by placing

META NAME="robots" CONTENT="noindex"

in their headers.  Links in them will still be followed, but htdig
will not index the words in them.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



AW: [htdig] irrelevant pages in search

1999-11-25 Thread Hartmut Steffin

Thanks for the answer,

  htmerge does not seem to honour the TMPDIR variable which
 IS properly set
this seems to be an individual problem on my machine. there is even a
difference in running rundig from commandline (ok) and via cron/batch
(erroneous)

  in ANY case,
  1. htmerge should do a better error message (I even used -v)

 We're open to suggestions, but if the problem is the sort
 program that fails
 silently, there isn't much that htmerge can do to guess at why.
hmm, maybe this was me yelling out too loud without thinking. I think you
cannot do more than supplying stderr of sort plus maybe errno the exit value
as a hint.

  2. htsearch should be able to identify a corrupt db
 I too would like to see more error checking to detect such
 problems, but
 I wouldn't know where to begin in adding code, and what to
 look for in terms
 of database problems.  Anyone else have any ideas?
IMHO this is the most important part. I did not have a look at sources so
far, but isn't it possible to have a flag "under_construction" somewhere (as
part of the db itself) that is set as long as different files of the db are
not reflecting the status quo? I am not in internals, but i feel you even
have bad results between running htdig and htmerge? so the flag could even
state "ok", "htdig running", "sorting", "merging"  (and possibly count
in the presence of the -i flag if necessary)
htsearch could read this flag and tell if a search might be unreliable right
now. (or even give this wonderful message "contact the webmaster" :(

Just ideas, I don't know how practicable.
Hardy




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] Foreign chars (Swedish)

1999-11-25 Thread Philippe Ramkvist-Henry


Hello!

I'm having problems with some foreign chars when using htdig to index and
search a Swedish site. The locale is set right (sv) and is working in
other applications. The problem I have is somewhat weird, maybe it has
something to do with "uppercase" "lowercase"?

Well, I can search words like "Åsa,åsa,Öl,öl" and get the same matches.
But when I try to search "bäst" I get no hits. With "bÄst" I get several
hits...

I asked a guy here a the University and he said that there might be
complications with "unsigned char" and "char". He gave me the example
below. Please answer at a novice level, my C++ and Unix knowledge is very
limited.  

Thanks
Philippe Ramkvist-Henry



 htlib/StringMatch.cc
 
 while ((unsigned char)string[pos])
 {
 new_state = table[trans[string[pos]]][state];
 
Should be? or? 
 
 while (string[pos])
 {
 new_state = table[trans[(unsigned 
 char)string[pos]]][state];
  
   



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] pure numbers as search words

1999-11-25 Thread florian . nill

From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Date: Thu, 25 Nov 1999 15:37:10 +0100
Subject: pure numbers as search words

Hi everybody,

as a new user of htdig I have the following problem:

Although search strings combined of letters and digits are properly
found,
a a string consisting of digits only is completely disregarded.
Is there a way to reconfigure this?

Thanks in advance

Florian Nill
 floriann.vcf


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.


Re: [htdig] pure numbers as search words

1999-11-25 Thread Geoff Hutchison

At 3:37 PM +0100 11/25/99, [EMAIL PROTECTED] wrote:
a a string consisting of digits only is completely disregarded.
Is there a way to reconfigure this?

See http://www.htdig.org/attrs.html#allow_numbers

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] i need help on htdig database format

1999-11-25 Thread ronald

when htdig exports results from an index as textformat it generates two
files. The files look like this :

file1:
0   u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software
a:0 m:936027636 s:373   h:  h:  l:940510479 L:2 I:373   
d:http://www.htdig.org/
www.htdig.org ht://Dig Search Software (yes, the developers use it)
ht://Dig Parent Directory   A:
1   u:http://www.htdig.org/contents.htmlt:ht://Dig Table of Contentsa:0
m:936027636 s:3539  h: Contents General ht://Dig Features and Requirements
Where to get it Installation Configuration FAQ Mailing list Uses of
ht://Dig License information Reference htdig htmerge htnotify htfuzzy
htsearch Configuration file META tags Other How it works Contributors
Release notes ChangeLog TODO Bug Reporting Contributed Work Website stats
Developer Site Quick Search:h:  l:940510479 L:25I:3539  
d:/contents.htmlA:
2   u:http://www.htdig.org/main.htmlt:ht://Dig: Overviewa:0 
m:940044123
s:3717  h: WWW Search Engine Software ht://Dig Copyright (c) 1995-1999 The
ht://Dig Group Please see the file COPYING for license information. Recent
News * 22 Sep 1999: A new stable release of ht://Dig, htdig-3.1.3, is
released. This release is recommended for all production systems. It solves
most of the outstanding bugs in the 3.1.x releases. See the release notes
or download it. * 1 June 1999: Unfortunately, due to lack of interest from
key developers, the ht://Dig Conference from Aug 19-20 will be cancelled.
We hope h:  l:940510480 L:10I:3717  d:ht://Dig /main.html   A:
3  and so on.


file2:
01oct99 i:115   l:0 w:100998c:2
01oct99 i:116   l:0 w:100998c:2
01oct99 i:45l:6 w:100381c:2
01oct99 i:46l:0 w:100998c:2
02aug1999   i:48l:361   w:639   a:2
02jun1999   i:50l:262   w:1382  c:2 a:2
02mar1999   i:53l:378   w:622   a:2
02may1999   i:51l:280   w:1349  c:2 a:2
and so on


Can anyone please tell me exactly what these fields mean ? 

Ronald





_
Ronald Tournier
Stichting De Digitale Stad
1011 TD Amsterdam
tel. 020 6257493
fax. 020 6382817
tel direkt: 020 5205335
e-mail: [EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] WordPerfect parser?

1999-11-25 Thread Gilles Detillieux

According to David Adams:
 I have downloaded the parse_doc.pl script, and the xpdf and catdoc
 utilities, and I am now using them to extend our search index to include
 Word and PDF files.  It all works well and with a bit of alteration to
 the Perl script does exactly what I want.  My thanks to the developers!
 
 We also have a need to index WordPerfect documents, including those
 produced by WP 6.1 and later.  Can anyone recommend a utility that will
 run under IRIX 6.5 ?

I haven't come across any open source/freeware WP to text converters.
The reason I put the WP hooks in there originally was because some sites
had .doc files that were WP rather than Word documents, and the WP documents
caused catdoc to blow chunks.  Same story for .doc files in RTF format.
I then realised there are all sort of .doc files that aren't MS-Word,
so I put in explicit checks for MS-Word magic numbers rather than using
catdoc by default, but still kept the WP and RTF hooks in by way of
example.

If WordPerfect for UNIX is available for IRIX, and it contains the cvt
utility as WP for Linux does, you could write a script that uses that,
or adapt the parse_doc.pl script to use it directly.  Its usage is:

/usr/local/wplinux/shbin10/cvt -l file.wpd file.txt asci  /dev/null

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] parse_doc.pl alterations

1999-11-25 Thread Gilles Detillieux

According to David Adams:
 I have downloaded the parse_doc.pl script, and the xpdf and catdoc
 utilities, and I am now using them to extend our search index to include
 Word and PDF files.  It all works well and with a bit of alteration to
 the Perl script does exactly what I want.  My thanks to the developers!

I forgot to ask before, what were your alterations?  Something very
specific to your needs, or something worth sharing with other?

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] Rundig

1999-11-25 Thread Gilles Detillieux

According to Jason Carvalho:
 When I run 'rundig', it crawls my web site then when it comes to the
 merge stage, it outputs:
 
 Deleted, no excerpt :2156 http://ww...etc.   for loads of my pages.
 
 All in all, it found about 9500 pages but only merged 7500, giving the
 above message for the rest.
 
 What does this mean?

The two most common causes are:  a) the document contained no text, or
the text was excluded by noindex meta tags, or b) the document was
disallowed by the server's robots.txt file.  If you ran htdig or rundig
with -vvv, then htdig's output should give you more of an indication of
which situation arose with these pages.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] word_list columns

1999-11-25 Thread Aaron Turner


there are 6 columns in the wordlist file.  Obviously col1 is the word.
What are the others? (i, l, w, c a)

--
Aaron Turner, Core Developer   http://vodka.linuxkb.org/~aturner/
Linux Knowledge Base Organization  http://linuxkb.org/
Because world domination requires quality open documentation.
aka: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: AW: [htdig] irrelevant pages in search

1999-11-25 Thread Doug Barton

Hartmut Steffin wrote:
 
 Thanks for the answer,
 
   htmerge does not seem to honour the TMPDIR variable which
  IS properly set
 this seems to be an individual problem on my machine. there is even a
 difference in running rundig from commandline (ok) and via cron/batch
 (erroneous)

It's not a plot against you, honest. :) If you get different results from
the command line and from cron it simply means that cron's environment is
different from the shell's. You might try setting the TMPDIR environment
explicitly in the crontab file and see if that improves things. 

Good luck,

Doug
-- 
"Welcome to the desert of the real." 

- Laurence Fishburne as Morpheus, "The Matrix"


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.