[htdig] Rundig
When I run 'rundig', it crawls my web site then when it comes to the merge stage, it outputs: Deleted, no excerpt :2156 http://ww...etc. for loads of my pages. All in all, it found about 9500 pages but only merged 7500, giving the above message for the rest. What does this mean? -- -- Jason Carvalho Web Analyst Cranfield University [EMAIL PROTECTED] -- To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] Reducing the importance of pages.
Is it possible to reduce the importance of certain pages? We have some pages on our site that are directories and contain thousands of entries. As a result they always seem to come up as top results whenever we search for anything. I don't really want to remove these pages from a search but I would like them tol appear lower down the list. Is this at all possible (perhaps by using negative weighting or similar?)? Thanks! -- -- Jason Carvalho Web Analyst Cranfield University [EMAIL PROTECTED] You could increase the weighting of other pages by encouraging the use of META NAME="keywords" CONTENT="...list of keywords..." and META NAME="description" CONTENT="...relevent text..." in their headers. On our site we have increased the weighting of keywords to 200. You might consider not indexing the directory pages atall by placing META NAME="robots" CONTENT="noindex" in their headers. Links in them will still be followed, but htdig will not index the words in them. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
AW: [htdig] irrelevant pages in search
Thanks for the answer, htmerge does not seem to honour the TMPDIR variable which IS properly set this seems to be an individual problem on my machine. there is even a difference in running rundig from commandline (ok) and via cron/batch (erroneous) in ANY case, 1. htmerge should do a better error message (I even used -v) We're open to suggestions, but if the problem is the sort program that fails silently, there isn't much that htmerge can do to guess at why. hmm, maybe this was me yelling out too loud without thinking. I think you cannot do more than supplying stderr of sort plus maybe errno the exit value as a hint. 2. htsearch should be able to identify a corrupt db I too would like to see more error checking to detect such problems, but I wouldn't know where to begin in adding code, and what to look for in terms of database problems. Anyone else have any ideas? IMHO this is the most important part. I did not have a look at sources so far, but isn't it possible to have a flag "under_construction" somewhere (as part of the db itself) that is set as long as different files of the db are not reflecting the status quo? I am not in internals, but i feel you even have bad results between running htdig and htmerge? so the flag could even state "ok", "htdig running", "sorting", "merging" (and possibly count in the presence of the -i flag if necessary) htsearch could read this flag and tell if a search might be unreliable right now. (or even give this wonderful message "contact the webmaster" :( Just ideas, I don't know how practicable. Hardy To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] Foreign chars (Swedish)
Hello! I'm having problems with some foreign chars when using htdig to index and search a Swedish site. The locale is set right (sv) and is working in other applications. The problem I have is somewhat weird, maybe it has something to do with "uppercase" "lowercase"? Well, I can search words like "Åsa,åsa,Öl,öl" and get the same matches. But when I try to search "bäst" I get no hits. With "bÄst" I get several hits... I asked a guy here a the University and he said that there might be complications with "unsigned char" and "char". He gave me the example below. Please answer at a novice level, my C++ and Unix knowledge is very limited. Thanks Philippe Ramkvist-Henry htlib/StringMatch.cc while ((unsigned char)string[pos]) { new_state = table[trans[string[pos]]][state]; Should be? or? while (string[pos]) { new_state = table[trans[(unsigned char)string[pos]]][state]; To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] pure numbers as search words
From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Date: Thu, 25 Nov 1999 15:37:10 +0100 Subject: pure numbers as search words Hi everybody, as a new user of htdig I have the following problem: Although search strings combined of letters and digits are properly found, a a string consisting of digits only is completely disregarded. Is there a way to reconfigure this? Thanks in advance Florian Nill floriann.vcf To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] pure numbers as search words
At 3:37 PM +0100 11/25/99, [EMAIL PROTECTED] wrote: a a string consisting of digits only is completely disregarded. Is there a way to reconfigure this? See http://www.htdig.org/attrs.html#allow_numbers -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] i need help on htdig database format
when htdig exports results from an index as textformat it generates two files. The files look like this : file1: 0 u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software a:0 m:936027636 s:373 h: h: l:940510479 L:2 I:373 d:http://www.htdig.org/ www.htdig.org ht://Dig Search Software (yes, the developers use it) ht://Dig Parent Directory A: 1 u:http://www.htdig.org/contents.htmlt:ht://Dig Table of Contentsa:0 m:936027636 s:3539 h: Contents General ht://Dig Features and Requirements Where to get it Installation Configuration FAQ Mailing list Uses of ht://Dig License information Reference htdig htmerge htnotify htfuzzy htsearch Configuration file META tags Other How it works Contributors Release notes ChangeLog TODO Bug Reporting Contributed Work Website stats Developer Site Quick Search:h: l:940510479 L:25I:3539 d:/contents.htmlA: 2 u:http://www.htdig.org/main.htmlt:ht://Dig: Overviewa:0 m:940044123 s:3717 h: WWW Search Engine Software ht://Dig Copyright (c) 1995-1999 The ht://Dig Group Please see the file COPYING for license information. Recent News * 22 Sep 1999: A new stable release of ht://Dig, htdig-3.1.3, is released. This release is recommended for all production systems. It solves most of the outstanding bugs in the 3.1.x releases. See the release notes or download it. * 1 June 1999: Unfortunately, due to lack of interest from key developers, the ht://Dig Conference from Aug 19-20 will be cancelled. We hope h: l:940510480 L:10I:3717 d:ht://Dig /main.html A: 3 and so on. file2: 01oct99 i:115 l:0 w:100998c:2 01oct99 i:116 l:0 w:100998c:2 01oct99 i:45l:6 w:100381c:2 01oct99 i:46l:0 w:100998c:2 02aug1999 i:48l:361 w:639 a:2 02jun1999 i:50l:262 w:1382 c:2 a:2 02mar1999 i:53l:378 w:622 a:2 02may1999 i:51l:280 w:1349 c:2 a:2 and so on Can anyone please tell me exactly what these fields mean ? Ronald _ Ronald Tournier Stichting De Digitale Stad 1011 TD Amsterdam tel. 020 6257493 fax. 020 6382817 tel direkt: 020 5205335 e-mail: [EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] WordPerfect parser?
According to David Adams: I have downloaded the parse_doc.pl script, and the xpdf and catdoc utilities, and I am now using them to extend our search index to include Word and PDF files. It all works well and with a bit of alteration to the Perl script does exactly what I want. My thanks to the developers! We also have a need to index WordPerfect documents, including those produced by WP 6.1 and later. Can anyone recommend a utility that will run under IRIX 6.5 ? I haven't come across any open source/freeware WP to text converters. The reason I put the WP hooks in there originally was because some sites had .doc files that were WP rather than Word documents, and the WP documents caused catdoc to blow chunks. Same story for .doc files in RTF format. I then realised there are all sort of .doc files that aren't MS-Word, so I put in explicit checks for MS-Word magic numbers rather than using catdoc by default, but still kept the WP and RTF hooks in by way of example. If WordPerfect for UNIX is available for IRIX, and it contains the cvt utility as WP for Linux does, you could write a script that uses that, or adapt the parse_doc.pl script to use it directly. Its usage is: /usr/local/wplinux/shbin10/cvt -l file.wpd file.txt asci /dev/null -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] parse_doc.pl alterations
According to David Adams: I have downloaded the parse_doc.pl script, and the xpdf and catdoc utilities, and I am now using them to extend our search index to include Word and PDF files. It all works well and with a bit of alteration to the Perl script does exactly what I want. My thanks to the developers! I forgot to ask before, what were your alterations? Something very specific to your needs, or something worth sharing with other? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] Rundig
According to Jason Carvalho: When I run 'rundig', it crawls my web site then when it comes to the merge stage, it outputs: Deleted, no excerpt :2156 http://ww...etc. for loads of my pages. All in all, it found about 9500 pages but only merged 7500, giving the above message for the rest. What does this mean? The two most common causes are: a) the document contained no text, or the text was excluded by noindex meta tags, or b) the document was disallowed by the server's robots.txt file. If you ran htdig or rundig with -vvv, then htdig's output should give you more of an indication of which situation arose with these pages. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] word_list columns
there are 6 columns in the wordlist file. Obviously col1 is the word. What are the others? (i, l, w, c a) -- Aaron Turner, Core Developer http://vodka.linuxkb.org/~aturner/ Linux Knowledge Base Organization http://linuxkb.org/ Because world domination requires quality open documentation. aka: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: AW: [htdig] irrelevant pages in search
Hartmut Steffin wrote: Thanks for the answer, htmerge does not seem to honour the TMPDIR variable which IS properly set this seems to be an individual problem on my machine. there is even a difference in running rundig from commandline (ok) and via cron/batch (erroneous) It's not a plot against you, honest. :) If you get different results from the command line and from cron it simply means that cron's environment is different from the shell's. You might try setting the TMPDIR environment explicitly in the crontab file and see if that improves things. Good luck, Doug -- "Welcome to the desert of the real." - Laurence Fishburne as Morpheus, "The Matrix" To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.