[htdig] Fw: doc2html

2001-01-22 Thread David Adams

When I wrote doc2html I copied this without change from conv_doc, and I
think it is the same in the original parse_doc parser script.  Is Leong
correct?
--
David Adams
Computing Services
Southampton University


- Original Message -
From: "Leong Peck Yoke" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Sunday, January 21, 2001 1:18 PM
Subject: doc2html


 Hi,

 I am look at your code doc2html.pl for a project. I notice that in
 function try_text at line 366, the following code

 s/\255/-/g; # replace dashes with hyphens

 seems to be wrong. Shouldn't it be "s/\055/-/g" instead?



 Regards,
 Peck Yoke






To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] Spelling Help

2001-01-18 Thread David Adams

I am trying to do what I can to aid those with spelling difficulties perform
searches on our web pages.
This was triggered by seeing in the htsearch log that attempts to find
"accomodation" were finding some pages, but not the important ones (where it
is spelt correctly)!

Also this University has a commitment to supporting disabled students,
including those with dyslexia.

I would like to ask:

1)What have other sites done to address this problem?  (Spell checking
and correcting our own
pages is not possible at present, and may never be.)

2)Can anybody recommend a _good_ (UK English) spell checker for IRXIX
6.5?
(The IRIX spell command does not know a lot of important words, such
as "midwifery", and I can't
  figure out how to addend to the dictionary.)
A spell checker that could suggest words (as do the spell checkers
in word processors, etc.)
would be wonderful.

--
David Adams
Computing Services
Southampton University




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Unable to contact server-revisisted

2001-01-12 Thread David Adams

Is this server in your local network or remote?  It might be worth trying to
index it via a proxy cache.  I found that this cured the problem for us,
though it hasn't helped everybody.

Take a look at the http_proxy configuration file attribute.

--
David Adams
Computing Services
Southampton University


- Original Message -
From: "Roger Weiss" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, January 11, 2001 4:18 PM
Subject: [htdig] Unable to contact server-revisisted


 Hi,

 I'm running htdig v3.1.5 and my digging seems to be running out of steam
 after it runs for anywhere from 20 minutes to an hour or so. The initial
msg
 was "Unable to connect to server". So, I ran it again with -v v v   to get
 the error message below.

 pick: ponderingjudd.xxx.com, # servers = 550
 3213:3622:2:http://ponderingjudd.xxx.com/ponderingjudd/id6.html: Unable to
 build
  connection with ponderingjudd.xxx.com:80
  no server running

 I've replaced part of the URL with xxx to protect the innocent. The server
 certainly is running and I had no trouble finding the mentioned url. Is
 there some parm I need to set or limit I need to raise?
 We're running an apache server with startservers =25 and minspace=10.

 Thanks for your help,
 Roger

 Roger Weiss
 [EMAIL PROTECTED]
 (978) 318-7301
 http://www.trellix.com


 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] External converters - two questions

2001-01-11 Thread David Adams

Thanks Giles, that is usefull information.  I had thought that perhaps
pdtotext actually *added* hyphenation to a document.  If the problem is
removing the hyphenation that is actually written into the document then I
can see that not everbody will wish to do this.  It is easily switched off
in doc2html.pl, but only if you know where to look.  The next version will
definitely be better in this respect.

As for magic numbers, I'll wait and see if anybody else can offer some
additional observations.

--
David Adams
Computing Services
Southampton University


- Original Message -
From: "Gilles Detillieux" [EMAIL PROTECTED]
To: "David Adams" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Thursday, January 11, 2001 1:12 AM
Subject: Re: [htdig] External converters - two questions


 According to David Adams:
  I hope to find time for a further revision of the external converter
script
  doc2html.pl and possibly simplify it a little.
 
  The existing code includes de-hyphenation (which is buggy) taken
originally
  from parsedoc.pl.  The question is:
  is this necessary, does pdftotext (or any other utility) actually break
up
  words across lines with the addition of hyphens?  Is the hyphenation
code of
  any use?  Information and opinions are requested.

 I added this code for dealing with a lot of the PDFs I needed to index
 on my site, and for the Manitoba Unix User Group web site as well (for
their
 newsletters).  Unlike HTML documents, I've found a lot of PDF files make
 pretty heavy use of hyphenation.  Without the dehyphenation code,
hyphenated
 words appeared as two separate words in the resulting text.  E.g. "conv-
 erter" was taken as "conv" and "erter", so a search for "converter" may
 not turn up this document if the word didn't appear unbroken elsewhere
 in the document.

 Sorry about the EOF bug in this code.  It was a quick hack, and I don't
 know Perl all that well.  There was a patch to fix this, though.  Are
there
 any other bugs?

 In any case, in parse_doc.pl and conv_doc.pl, I wrote it to be optional,
 enabled by this line:

 $dehyphenate = 1;   # PDFs often have hyphenated lines

 which only applied to PDFs.  The ps2ascii utility already does its own
 dehyphenation, but pdftotext doesn't.  Other document types are less
 likely to need this.  If dehyphenation of PDFs is not desired, it's easy
 enough to change the 1 to a 0 above when configuring the script.  I don't
 recall if your doc2html.pl has the same sort of option.

  Also inherited from parsedoc.pl is extra code to cope with files which
may
  be an "HP Print job" or contain a "MacBinary header".  Are such files
really
  encountered?  If so what type of files are they, Word, PDF or what?
  Does the magic number code need to take account of them?

 Another hack of mine.  The MUUG web site had some pretty odd-ball
 PostScript files on it that were causing error messages while indexing
 their site.  Instead of simple and pure PS in these files, some had a
 MacBinary wrapper or HP PJL codes in them, which ps2ascii happily would
 skip over, but the Perl code wasn't accepting these files.  These hacks
 were to allow these files through.  Dunno if anyone else has found they
 help or hurt them, but I'm keeping them in my own copies of the scripts.
 I know they're kind of ugly, so if you want to get rid of them in your
 code for the sake of simplicity, I'd certainly understand.

 --
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:
http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930

 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] PDFs, numbers, and percent signs

2001-01-10 Thread David Adams

At this stage it is not so much you being given ideas as you supplying
enough information.

What parser are you using?  Are you using it directly or via a script such
as parsedoc or doc2html.

You say "25%" occurs in the parser O/P.  Is that the output direct from
pdftotext, or from doc2html or what?

--
David Adams
Computing Services
Southampton University

- Original Message -
From: "Philip E. Varner" [EMAIL PROTECTED]
To: "Geoff Hutchison" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, January 10, 2001 3:07 PM
Subject: Re: [htdig] PDFs, numbers, and percent signs


 Yes, "25%" shows up in the output of the parser.  I searched for a word
 near an instance of it in a document, and the long results print out
 the "25%" too.  Any other ideas?

 Phil

 On Tue, 9 Jan 2001, Geoff Hutchison wrote:

 : At 1:52 PM -0500 1/9/01, Philip E. Varner wrote:
 : So, I'm guessing this is either a problem with the percent sign
(25%,etc),
 : or not having _all_ words indexed.
 :
 : I'd run a PDF through your external parser/converter and take a look
 : at the output. Are you seeing 25% (or whatever) showing up there?
 :
 : --
 : -Geoff Hutchison
 : Williams Students Online
 : http://wso.williams.edu/
 :
 : 
 : To unsubscribe from the htdig mailing list, send a message to
 : [EMAIL PROTECTED]
 : You will receive a message to confirm this.
 : List archives:  http://www.htdig.org/mail/menu.html
 : FAQ:http://www.htdig.org/FAQ.html
 :

 --

 A distributed system is one in which the failure of a computer you
 didn't even know existed can render your own computer unusable.
 -- Leslie Lamport


 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] IRIX compile fix

2001-01-09 Thread David Adams

This may help the query about compiling htdig under IRIX:

Forwarded message:
 From [EMAIL PROTECTED] Thu Aug 31 14:13:11 2000
 Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
 Precedence: bulk
 Delivered-To: mailing list [EMAIL PROTECTED]
 Date: Thu, 31 Aug 2000 14:09:47 +0100 (BST)
 From: Bob MacCallum [EMAIL PROTECTED]
 Message-Id: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Subject: [htdig] IRIX compile fix
 Content-Length: 999
 X-Status: 
 X-Keywords:
 X-UID: 252
 
 
 Hello,
 
 I just managed to compile htdig for IRIX 6.5 without the o32/n32
 error and also without too many warnings,
 using cc (not gcc).  For the record, here is what I had to do,
 it differs a little from what it says in:
 http://www.mail-archive.com/htdig@htdig.org/msg00832.html
 
 ./configure 
 
 edit Makefile.config to change these two lines
 # LIBDIRS=-L../htlib -L../htcommon -L../db/dist -L/usr/lib
 LIBDIRS=  -L../htlib -L../htcommon -L../db/dist
 # LIBS=   $(HTLIBS) -lz -lnsl -lsocket 
 LIBS= $(HTLIBS) -lz -lsocket 
 
 make
 
 that's it.  it works, and as usual, I don't really know why... ;-)
 our /etc/compiler.defaults are: -DEFAULT:abi=n32:isa=mips3
 I'm not subscribed to the list (yet), so any replies to me please.
 
 Bob.
 
 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html
 
 


-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] Re: ht://dig on IRIX 6.5 (fwd)

2001-01-09 Thread David Adams

Here is part of an old email describing another way of compiling htdig
under IRIX, it works for us. 

  
  I used the following script to run the configure command. I try to 
  always run configure from a script rather than by hand - that way I don't have 
  to remember what options I had to use to get it working.
  
  #!/bin/sh
  CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS
  CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS
  LDFLAGS="-mips4"; export LDFLAGS
  ./configure --prefix=/opt/local/htdig-3.1.2 
  --with-cgi-bin-dir=/opt/local/htdig-3.1.2/cgi-bin 
  --with-image-dir=/opt/local/htdig-3.1.2/graphics 
  --with-search-dir=/opt/local/htdig-3.1.2/htdocs/sample
  
  Clearly the important bits are the FLAG settings for C, C++ and the linker. We 
  are using MipsPro 7.3 compilers both for the C and C++ compilers. 
  
 

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] External converters - two questions

2001-01-09 Thread David Adams

I hope to find time for a further revision of the external converter script
doc2html.pl and possibly simplify it a little.

The existing code includes de-hyphenation (which is buggy) taken originally
from parsedoc.pl.  The question is:
is this necessary, does pdftotext (or any other utility) actually break up
words across lines with the addition of hyphens?  Is the hyphenation code of
any use?  Information and opinions are requested.

Also inherited from parsedoc.pl is extra code to cope with files which may
be an "HP Print job" or contain a "MacBinary header".  Are such files really
encountered?  If so what type of files are they, Word, PDF or what?
Does the magic number code need to take account of them?

--
David Adams
Computing Services
Southampton University



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] Fw: [htdig] - Question for start_url and exclude_urls

2001-01-05 Thread David Adams

Mohia,

Your colleague Aditya got into the habit of emailing his Ht://Dig
problems to me rather than to the htdig mailing list.
As this latest query is not something I can immediately answer I am
forwarding to the list.

For authoritative answers to all queries please always email to
[EMAIL PROTECTED] and not to me personally.

- Original Message -
From: "Mohai Wang" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, January 04, 2001 7:39 PM
Subject: [htdig] - Question for start_url and exclude_urls


 David,

 Aditya is been taking 3 weeks vacation from yesterday.  I am going take
this
 "htdig" search engine project.

 Question:
 1. start_url:
as long as start_url = "http://stagsite.coreon.com/download/". When I
run
 "rundig -vvv log", I got error message from screen "DB2 problem...:
missing
 or empty key value specified".  I also attached debug mode "log" and
 "htdig.conf" files, please take a look. Did I set wrong option?
 If start_url = "http://stagsite.coreon.com/" that it will go through to
 write index, because I only need to write everything under "download"
 nothing else.

 2. exclude_urls:
I try to do something differently, start_url =
 "http://stagsite.coreon.com/" then I added exclude_urls = "/cgi-bin/
 /calendar/ /coreonlib/". When I run "rundig -vvv log3", it will read
 /coreonlib/ first then stop.  After I took off "coreonlib" from
exclude_urls
 then rerun "rundig -vvv log2" that everything are indexing and reject
 "cgi-bin" and "calendar".  Could you tell me why?  Please take a look log3
 file.



 Mohai Wang
 Coreon Inc.,

--
David Adams
Computing Services
University of Southampton

 log.dat
 htdig.conf
 log3.dat


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html


Re: [htdig] doc2html hangs while parsing PDFs

2001-01-03 Thread David Adams


On Wed, 03 Jan 2001 15:59:57 +0100 Berthold Cogel 
[EMAIL PROTECTED] wrote:

 Hello!
 
 I just tried to index our site with htdig-3.1.5 on a Sun UltraSparc with
 SunOS 5.7.
 To parse PDF documents I used doc2html and pdftotext. My first mistake
 was to leave max_doc_size at the default value. But I don't think that
 this was the reason for my problem:
 
 Sometimes doc2html hangs and eats resources and produces a unknown child
 process with defunct signature in the top list (perhaps pdftotext?). 
 

There is a known bug in the hyphenation code in doc2html.pl 
which causes it to loop indefinitely when parsing a .PDF 
file when the last character is a hyphen.  This 
seems unlikely, but I have seen it.

In sub try_text change:

  while (CAT) {
while ( m/[A-Za-z\300-\377]-\s*$/  $set-{'hyph'}) {
  ($_ .= CAT) || last;
  s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
}
s/\255/-/g; # replace dashes with hyphens

To:

 while (CAT) {
   while ( m/[A-Za-z\300-\377]-\s*$/  $set-{'hyph'}) {
 $_ .= CAT;
 last if eof;
 s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
   }
   s/\255/-/g; # replace dashes with hyphens

 I don't think that the document size is a reason for this effect,
 because some of the files that caused the trouble (last line in
 htdig.log) had a size of only 10 to 40 KByte. Some bigger files (up to
 34 MByte) didn't stop doc2html. 
 
 By the way: Where do I have to set $Verbose?

sub init {

  # set = 1 for O/P on stderr if successful
  $Verbose = 1;

 Is it possible to write the
 messages of pdftotext and doc2html in a separate logfile?
 

Perhaps in the next version of doc2html.

 Why doesn't take htdig/doc2html the complete document for parsing. You
 only have to take max_doc_size into account when you take the parsed
 documents for indexing. This might reduce the problems with doctypes
 other than html or plain text.

max_doc_size affects all documents fetched by htdig.  It is 
a safety device to prevent the downloading of extremely 
large (or infinitely long!) documents.

 
 Thanks in advance
 
 Berthold Cogel
 
 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html
 

--
David Adams
[EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Question about parsing word, pdf, ppt etc.

2000-12-19 Thread David Adams

Try executing the parsers at the command line to see what happens.

I don't know, but it seems quite possible that the current version of
ppt2html is not able to cope with the Powerpoint 2000 format.  If that is
the case you could try contacting the author directly.  I have noticed that
ppt2html can require a lot of memory (several hundred megabytes) to convert
some .ppt files, could you have a problem with a shortage of memory?

Are you using catdoc or wp2html to convert Word files?  Wp2html extracts the
'subject' from the document summary and puts it in the header, which might
be the problem.  Catdoc does often include gibberish in its output, and you
could find removing the -b option in the call of catdoc an improvement.

Doc2html.pl uses pdfinfo to extract the title of the .PDF file, and I have
seen .PDF documents where the title is 'þÿ ' for some reason.  You might
need to modify doc2html.pl to supress such titles.

- Original Message -
From: "Aditya Shah" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tuesday, December 19, 2000 2:28 AM
Subject: [htdig] Question about parsing word, pdf, ppt etc.


 Hello,

 We are evaluating the use of htDig for an intranet site. Our users publish
a
 lot of Word, Excel, Powerpoint and PDF Documents that we want to be able
to
 search through.

 We have been able to get all the external parsers required. We have run
into
 the following issues:

 1) Unable to parse powerpoint Documents. The documents are MS- Powerpoint
 2000 Documents. We got the ppt2html parser from www.xlHtml.org . The
 statements in htdig.conf are something like this:

 application/msexcel-text/html /app/doc2html/doc2html.pl \
   application/mspowerpoint-text/html
 /app/doc2html/doc2html.pl \
   application/vnd.ms-excel-text/html
 /app/doc2html/doc2html.pl \
   application/vnd.ms-powerpoint-text/html
 /app/doc2html/doc2html.pl

 Excel works great, but for powerpoint, when I run the 'rundig' program, it
 just kind of hangs there.

 2) Getting gibberish in the headers for some word and pdf documents. For
 example, for a word document:

 In doc 2 html ; ; ; ; ; ; ; ;   Fax Fax Please Recycle Comments: `Þ"Û?
 gP?]...øu-OwPÄ?+`É?0|?(ÜÐ oè?UYÆìÌ{èO?ãôrsÊ-| ?]ç* ú! Ý^mÀB?t
 5?z+¿Hc-Ð#*ÄgÔ"C?ò,mÎ?Púss (_ûÛ~$Û+-V Sö?ýô?_+ywì?lt;?-? ?\...Y ...

 when the search results are returned. This does not happen for all word
 documents, only for some of them.

 And for a PDF document, we always get the 'þÿ ' character before any file
 name in the search results section.

 Also, do you know if there is a parser for MS-Visio?

 Any help would be appreciated.

 Thanks.

 Aditya Shah


 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Problem with virtual server-names

2000-11-28 Thread David Adams


- Original Message - 
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, November 28, 2000 1:21 PM
Subject: [htdig] Problem with virtual server-names


 Hi,
 we have 2 different aliases and 1 IP-adress on one webserver:
 1.) http://www.abc.de
 2.) http://www2.abc.de
 
 in the config-file we set allow_virtual_hosts: true
 the start_url ist set to http://www2.abc.de/map1
 the limit_urls is the same that start_url.
 
 When we run htdig with -c configfile, there is only one message:
 
 New server: www2.abc.de, 80 
 
 What to do?
 
 Greetings
 Uli
 
 
 

If you want to index both servers then you should have:

limit_urls_to:http://www2.abc.de/http://www.abc.de

you have set it to index one page on one server.

--
David Adams
University of Southampton



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] Antwort: Re: [htdig] Problem with virtual server-names

2000-11-28 Thread David Adams


On Tue, 28 Nov 2000 14:43:35 +0100 
[EMAIL PROTECTED] wrote:

 Hi David,
 
 it is a little bit different:
 
 We have 2 aliases but i want to index only one. So i write in the configfile
 start_url: www2.abc.de/map1.
 But htdig said: unable to build connection with www2.abc.de, 80.
 
 greetings
 
 uli
 
 

Then you need

limit_url_to: www2.abc.de/

otherwise htdig will only index the page www2.abc.de/map1

You did not mention the "unable to build connection" error 
message in your first email.  Perhaps your server is down 
or you have a network problem?

------
David Adams
[EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Does htmerge remove URL from database ?

2000-11-27 Thread David Adams

I found that the extra runs of htmerge were necessary when I was merging two
runs of htdig.  Unless I ran both databases through htmerge before merging
them I was getting

Deleted, invalid:

against some pages in the htmerge run.  Compared to the time required to run
htdig, the extra htmerge runs are trivial, so you have little to loose by
including them.

Use the -v option with both htdig and htmerge and see if you get any message
re the pages that don't appear in the final index.


- Original Message -
From: "Geoff Hutchison" [EMAIL PROTECTED]
To: "Olivier Korn" [EMAIL PROTECTED]
Cc: "Gilles Detillieux" [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Sunday, November 26, 2000 4:07 AM
Subject: Re: [htdig] Does htmerge remove URL from database ?


 At 2:21 PM +0100 11/23/00, Olivier Korn wrote:
 I tried it and it didn't solve the problem. BTW, I don't think that
 these extra merges are necessary either.

 No, they should not be at all necessary unless there's truly
 something horrific wrong with the merging code--it only uses the
 files directly output from htdig. (My idea was that it would be
 faster if you didn't need to run htmerge on intermediate DB.)

 Now, I run :
 htmerge -c site#.conf
 then
 htmerge -c site1.conf -m site#.conf (with #  1)
 
 If I then run
 htsearch -c site5.conf with words="rénovation tourisme", it finds
 the document (in first place.)
 But if I do
 htsearch -c site1.conf with the same words, it returns the "nomatch"
document.
 
 Some of the web hosts are case sensitives and some are not. Could it
 be the source of my problem ?

 I wouldn't think so. But you have to be pretty careful that the URL
 encodings are shared between your site.conf files. Personally, I make
 up a "main.conf," include that in the other files and only set the
 start_url and a minimal number of things in the individual site.conf
 files. In particular, it makes it easy to change something in all
 config files at once.

 --
 -Geoff Hutchison
 Williams Students Online
 http://wso.williams.edu/

 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] different search results

2000-11-20 Thread David Adams

Gilles R. Detillieux wrote:
 
 According to gkalter:
  Hope this mailing-list is the right one..;-)
  
  Today I got htdig to work pretty well on a site containing many
  PDF-Files.
  
  • Cobalt Raq2 micorserver (mips) with RedHat based Linux
  
  After updating the C++ Compiler (see mailing list) I got rid of the
  segmenatition
  error messages and htdig worked well.
  
  Cryptic outputs of the search form were solved by adding a ".cgi"
  extension to htsearch
  in the local cgi-bin folder. Solution also found in the list - thanks to
  all those helpful people!
 
 I think the FAQ also has some pointers on getting the CGI to work.
 
  Because I wanted to get direct links to single PDF Pages out of the
  found excerpts I got
  the pdftodig.py script for external parsing of PDF-Files. (Do I have to
  mention that python
  IS NOT installed on Cobalt Raqs?) O.K. this problem could also be
  solved.
 
 It would also be a fairly trivial change to the perl scripts conv_doc.pl
 or doc2html.pl to make it replace form feeds in pdftotext output with
 the correct HTML a name="..." tags for the anchors.  You'd then be
 using an external converter, rather than an external parser, and possibly
 avoiding parser-related problems.
 
  Now everything works pretty good with one little exception.
  
  Using a complete search string e.g. "Sensor" lists all matching
  documents and the text contains
  the search word (bold typeface) with a link to the specific single Page
  of the found PDF file.
  (Great!)

I think I may be missing something here, perhaps somebody can explain
for me.  Am I right in thinking that the whole and only point of this is
to produce, in the lists produced by htsearch, excerpts from the first
page of .PDF documents containing a search word?

Or does one really get a link which when followed brings up the .PDF
document open at the relevant page?  If so, that would be quite something,
especially if it worked for a range of browsers.  What would be the correct
HTML a name="..." tags for the anchors?


-- 
 
David Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] htdig and MSWord

2000-11-14 Thread David Adams

 
 Hello,
 
 I operate with htdig since a short time and have now the following
 question: 
 
 If it is possible 
 to search the content of MSWORD 
 documents (Version 6.0, 7.0, WinWord 2000) using HTDIG? 
 
 or if there is another search mechanism 
 which could do it?? 
 
 Markus Fabritius
 
 -- 
 Sent through GMX FreeMail - http://www.gmx.net
 

Yes, using an external parser, specified by an

external_parsers:

statement in the configuration file.

On the htdig web site click on "Contributed work" and then "External Parsers".
You should use either doc2html.pl or conv_doc.pl, they are both Perl scripts
which call various utility programs to do the actual conversion.  Do not
use the old parse_doc script.

Doc2html.pl gives you a choice of either wp2html (very cheap commercial
product) or catdoc (public domain) to convert Word files. 

-- 
 
David Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] Antw: Re: [htdig] Problems with parse_doc.pl and German Umlaute

2000-10-27 Thread David Adams

I'm glad that doc2html works OK for you.

In Perl 
$WP2HTML = "";
and
$WP2HTML = '';
are equivalent.

On Thu, 26 Oct 2000 9:23:37 +0200 [EMAIL PROTECTED] wrote:

 Thanks for your help!
 Your tool works perfectly especially with German Umlaute. The description in the 
Details-File was very helpfull, so it was no problem for one who has no experience 
with perl to use doc2html.
 But there is one little annotation for the Details-File. In the install description 
you write: If you don't have a particular utility then set its location as a null 
string.  For example:
 $WP2HTML = '';
 
 I don't know but I think you mean $WP2HTML = ""; or?
 
 
 Christian Huhn
 
  [EMAIL PROTECTED] 25.10.2000  15.41 Uhr 
   Hi,
   I want to index PDF-Files with German Umlaute (ä, ö, ü, ß). Some tests had shown 
me that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German 
Umlaute, but the external parser parse_doc.pl has problems with them. It splits words 
with Umlaute in two words without the Umlaut.
  For example:
   w   beim41  0
  w   diesj   45  0
  w   hrigen  50  0
  w   den 58  0
  w   Platz   62  0
   In this case the German word "diesjährigen" is split in "diesj" and "hrigen" and 
I can find both with htsearch.
   Does anyone know how to solve this problem for example with a modified version 
of parse_doc.pl?
   Thanks,
   Christian Huhn
 
 
 You could try the doc2html parser.  I think that the latest version,
 available from the Ht://Dig web site, will not split words this way, but
 I have not tested it thoroughly.
 
 If doc2html does not parse your .PDF files properly, then email an
 example to me personally, and I'll make sure that the next version of
 doc2html works correctly.
 
 --  David J Adams
 [EMAIL PROTECTED]
 Computing Services
 University of Southampton
 
 
 

--
David Adams
[EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Problems with parse_doc.pl and German Umlaute

2000-10-25 Thread David Adams

 
 Hi,
 
 I want to index PDF-Files with German Umlaute (ä, ö, ü, ß). Some tests had shown me 
that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German Umlaute, 
but the external parser parse_doc.pl has problems with them. It splits words with 
Umlaute in two words without the Umlaut.
 For example:
 
 w   beim41  0
 w   diesj   45  0
 w   hrigen  50  0
 w   den 58  0
 w   Platz   62  0
 
 In this case the German word "diesjährigen" is split in "diesj" and "hrigen" and I 
can find both with htsearch.
 
 Does anyone know how to solve this problem for example with a modified version of 
parse_doc.pl?
 
 Thanks,
 
 Christian Huhn
 

You could try the doc2html parser.  I think that the latest version,
available from the Ht://Dig web site, will not split words this way, but
I have not tested it thoroughly. 

If doc2html does not parse your .PDF files properly, then email an
example to me personally, and I'll make sure that the next version of
doc2html works correctly. 

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Can't get my search to update correctly..

2000-10-11 Thread David Adams

 
 According to Rivera, Tony:
  However, that's not working...about 5 days ago I added a new
  directory /www/itss and have made numerous links to it from my index page
  and various other pages on the server and it is still not getting picked up
  when I do a search for it.
 
 I assume these are HTML links and not JavaScript ones.  One possibility is
 that the pages you modified are actually dynamic content (SSI, PHP, etc.)
 and so the server isn't returning a Last-Modified header.  If this is the
 case, htdig won't realize the pages have been modified.  You can set the
 modification_time_is_now attribute to true, but then htdig will reindex
 all dynamic pages every time it runs.

This seems to depend on the server.  The Southampton University server
(Apache) does SSI and allows
!--#include virtual=...
and 
!--#echo var="LAST_MODIFIED" --
but little else.

It does not put Last-Modified into the header when serving an SSI page. 
However, even with modification_time_is_now true it still returns "not
changed" unless the file *has* been modified since the last run of
htdig. 

But a departmental server we also index only gives "not changed"
on its .ps and .pdf files.

-- 
 
David Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] ... but not changed

2000-10-05 Thread David Adams

 
 According to Geoff Hutchison:
  On Wed, 4 Oct 2000, David Adams wrote:
   It had not occured to me that an SSI file was "dynamic", I live and learn!
  
  Yes, if you think about it for a second, you can realize that there's no
  way for the server to be entirely sure of the modification date for SSI
  files. It *could* send the date of the file itself, but what if it
  includes a file that has changed, or an actual CGI?
 
 Well, if the server were REALLY smart about it, it could keep track of
 the most recently modified include file or main file, and use that as
 the last modified date.  It would only need to suppress the header if
 CGI output is included in the mix.  Of course, the latter case would
 probably account for about 90% of SSI usage.  :)
 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
 

We certainly don't allow CGI output in SSI on our server, but I've no way
of knowing if that is unusual.

-- 
 
David Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] ... but not changed

2000-10-03 Thread David Adams

A simple query (I hope).

When, during an update run, htdig says of a page: "retrieved but not
changed", how does htdig decide that the page is the same as the last time?

An author is maintaining that she added a link to a page and that an update
run of htdig failed to follow the new link(s) she had added.

-- 
 
David Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Different domains?

2000-07-28 Thread David Adams

Quoting Ken Convery [EMAIL PROTECTED]:

 ht://dig looks like a great tool for maintainers of intranets and/or several
 internet web servers.  I have a question about it's application to something
 we are trying to do here.  We are developing relationships with a few other
 online companies and want to make content from their sites available by link
 on our site.  We are thinking we can use ht://dig to index those other sites
 so we can search out and display the pertinent information on our site in
 summary form and provide the link to a specific page on their site for more
 information.
 
 In a nutshell: can ht://dig index other web servers specified outside my
 domain or network?

Yes.  I maintain a "local community" index which now covers almost a thousand 
servers (real and virtual), most of them commercial.

I would recommend that you access them through a proxy, specified by the 
http_proxy: statement in the configuration file.

 
 If so would we need other than http to these other servers or any special
 access such as file system privileges?
 

No, but https servers are a special case, I can't answer for them.

 secondly are there any problems with sites that generate content
 dynamically?  Or will ht://dig simply look at static HTML pages or other
 static documents?
 

There are usually no difficulties with dynamic pages, but problems can occur.  
The exclude_urls: statement is intended to trap them.  In my case I only have

exclude_urls:   referer=

I suggest caution, adding sites one by one to your search list, and keeping
max_hop_count and server_max_docs low at first.

 
 Thank you very much
 Ken Convery
 Avian Pilot Systems Inc.
 


David Adams
[EMAIL PROTECTED]
Computing Services
Southampton University


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] same page, different ranking?

2000-07-28 Thread David Adams

Quoting Mike Lewis [EMAIL PROTECTED]:

 Hi,
 
 I've installed Htdig to test it for use on our site (we currently run
 Netscape Enterprise server and I don't like the built-in search).
 
 I have a problem. If I search for the word 'john' (boss' name) at
 http://kmi.open.ac.uk/search/ the top two pages found are boss' home page -
 but one gets 4 'stars' while the other gets only 1. The same result for a
 considerable number of other searches ('marc', 'paul', 'simon').
 
 I've had a look through the list archive but can't find an answer. Can
 anyone suggest why this might be happening?
 
 Thanks,
 Mike
 
 -- 
 Systems Administrator, Knowledge Media Institute (KMi)
 The Open University, Walton Hall, Milton Keynes MK7 6AA  UK
 http://kmi.open.ac.uk/
 Work: +44 (0) 1908 652832   Mobile: +44 (0) 7990 536490
 

The one page with two URL's, yes?

Then the answer must be in the "description_factor".  To quote the manual:

"Plain old "descriptions" are the text of a link pointing to a document. This 
factor gives weight to the words of these descriptions of the document. Not 
surprisingly, these can be pretty accurate summaries of a document's content."

The word "john" probably occurs in links to 
http://kmi.open.ac.uk/people/domingue/, but not in links to 
http://kmi.open.ac.uk/people/domingue/john.html 

To test this theory add to your configuration file:

description_factor: 0

and rebuild your index from scratch.


You might wish to consider whether to keep

description_factor: 0

permanently.  It's what we do.

Also I would suggest you attempt to sort out the mess of having one page with 
two URLs, though perhaps that is easier said than done.



David Adams
[EMAIL PROTECTED]
Computing Services
Southampton University


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-24 Thread David Adams

Quoting Gilles Detillieux [EMAIL PROTECTED]:

 According to David Adams:
  I use the standard MIPSpro compiler.  The script I use (thanks to my
 former 
  collegeaue James Hammick) to setup the Makefile is:
  
  #!/bin/sh
  CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS
  CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS
  LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib";
  export LDFLAGS
  ./configure --prefix=/opt/local/htdig-3.1.5 \
--with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \
--with-image-dir=/opt/local/htdig-3.1.5/graphics \
--with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample
  
  A lot of that is site-specific, and the "-rpath directory" option is
 only
  needed because the compression library is not in a standard place on the 
  machine on which htdig is run.
  
  The "-woff all" option suppresses most warning messages.  I will remove
 it,
  recompile htdig and send the result directly to Gilles, it might contain a
 clue.
 
 As Sinclair mentioned, 'you need to have the 2.95.2 gcc and the latest
 gnu "make".'  I don't know that anyone has ever gotten ht://Dig to work
 with SGI's own compiler.  If fact, we got a lot of reports from folks
 who couldn't even get it to compile.
 
 If you're really determined to get to the bottom of this and make it work
 with the SGI compiler, I wish you well, but I doubt I can help much.
 I looked at the output you sent me, and didn't really see any red
 flags pointing to an obvious problem.  I know that the Serialize and
 Deserialize functions for the db.docdb records can be a tad finicky, so
 that would probably be a place to look.  There could also be problems
 with incorrect assumptions about word sizes, e.g. if the SGI compiler
 has 64-bit long ints.  I'd also look at the db.wordlist records (they're
 ASCII text) before and after htmerge, to see if htdig is actually telling
 htmerge to remove some of these documents, or if htmerge is deciding to
 do so on its own.
 
 For the time being, the ht://Dig code hasn't had much of a workout on
 non-GNU compilers, so it doesn't seem to do well on them.  If you can
 help remedy that, great.  If you want to get the package working as
 quickly and easily as possible, I'd suggest trying the GNU C and C++
 compilers.
 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:   
 http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
 

I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a 
year and I have been very pleased with it.  I would say that we've given it a 
good workout here.  The problem with the "Deleted, invalid" messages only 
occurs with a second, relatively new search index.

The first index is made from a single run of htdig covering 33 servers, all in 
the local domain, and on this week's initial dig htmerge reports 49,233 
documents and not a single "Deleted, invalid".

The second index is made from two runs of htdig covering a total 969 (yes 969 
!) servers using a proxy.  Htmerge reports a mere 3,096 documents and 86 
"Deleted, invalid".

I have looked at the db.wordlist files (which are written to only by htdig - is 
that right?) and it would appear that htdig is flagging the pages for htmerge 
to delete and is not finding any words in them.

I can advance these theories:

It is not a bug, but is due to the use of a proxy. (I use a proxy 
because without one, a portion of the sites on any run of htdig were 
found to be not responding or even unknown.  With a proxy, htdig appears
to have no such problems.)

It is a bug due to the use of a proxy.

It is a bug which only shows when compiled under IRIX.

It is a bug which only occurs when there many different servers.

I intend to re-build the second index using htdig -vvv and perhaps learn 
something.

--
David Adams
[EMAIL PROTECTED]
Computing Services
Southampton University


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-18 Thread David Adams

Quoting Gilles Detillieux [EMAIL PROTECTED]:

  
  IRIX 6.5, Htdig 3.1.5
  
  One of the symptoms is that there is no consistency.  Today's re-index
  reported 84 pages to be invalid.  Of these only one was from the
  http://www.tregalic.co.uk/sacred-heart/ site, and this time it was
  churchpage7.html.  And that page is *NOT* found by any search on my index,
  though I can follow links to it from other pages and browse it.
  
  I don't see how you can investigate this yet, but unless people put in
  reports like mine you will always be able to claim the "no-one else
  is having this problem". 
  
  I will continue to look for a pattern which might give a clue. 
 
 I'm inclined to think this is a platform-specific problem.  Most of
 the trouble reports we've seen about IRIX systems are from users who
 can't even get htdig compiled, let alone running, so I don't think the
 package has had a thorough workout under IRIX.  Which compier did you
 use to build it?
 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:   
 http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
 

That is a possibilty worth pursuing.

I use the standard MIPSpro compiler.  The script I use (thanks to my former 
collegeaue James Hammick) to setup the Makefile is:

#!/bin/sh
CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS
CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS
LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib";
export LDFLAGS
./configure --prefix=/opt/local/htdig-3.1.5 \
  --with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \
  --with-image-dir=/opt/local/htdig-3.1.5/graphics \
  --with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample

A lot of that is site-specific, and the "-rpath directory" option is only
needed because the compression library is not in a standard place on the 
machine on which htdig is run.

The "-woff all" option suppresses most warning messages.  I will remove it,
recompile htdig and send the result directly to Gilles, it might contain a clue.

--
David Adams
[EMAIL PROTECTED]
Computing Services
Southampton University


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Htmerge: Deleted, invalid

2000-07-12 Thread David Adams

Why does htmerge 3.1.5 flag some pages, which look OK to me, as 
"Deleted, invalid" and not index them?

This is happening not just with .html pages but also .doc and .pdf files.

It happens with a simple merge following a run of htdig -i -a
and also when two htdig runs are merged using the htdig -m option.

David Adams
[EMAIL PROTECTED]
Computing Services
Southampton University


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] .pdf and .doc-files

2000-06-08 Thread David Adams


On Thu, 8 Jun 2000 09:12:32 -0500 (CDT) Gilles Detillieux 
[EMAIL PROTECTED] wrote:

 According to Andre Reuber:
  I am beginner in operating with htdig.  Ist there any possibility
  to make a index on .doc, .pdf, .xls, ... files? Do I need any extra
  source? Where can I get this source.
 
 See http://www.htdig.org/FAQ.html#q4.8
 and http://www.htdig.org/FAQ.html#q4.9
 
 The .xls files may be a bit more of a challenge.  I'd recommend using
 doc2html for .doc  .pdf, and if you find and install the Excel to HTML
 converter, xlHtml, you could probably add it to doc2html as an extra
 converter fairly easily (if you have at least a minor understanding
 of Perl).
 

I don't think it is quite so simple: doc2html.pl (and 
parse_doc and conv_doc) only use the "magic number" of the 
file to decide which utility to use for conversion.

MS Word and Excel files can have the same magic number.

The easy solution is a separate conversion script for excel 
files.  The sophisticated solution is a more advanced 
script which uses the information on MIME type passed to it.

I hadn't heard of xlHTML and would like to know more.  
As an alternative, there is a simple .xls to .csv 
conversion program available from the same site as catdoc.


 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


------
David Adams
[EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] local_user_urls not working

2000-05-25 Thread David Adams


On Thu, 25 May 2000 19:04:06 +0800 [EMAIL PROTECTED] 
wrote:

 Dear Sir,
 
 I installed the htdig on a Red Hat Linux 6.1 system.  Basically the htdig is working 
fine with the Apache server, but the local_user_urls setting never works.  As the 
attrs.html suggests, I have the following directive specified in my 
 /etc/htdig/htdig.conf:
 local_user_urls:http://host.mydomain/=/home/,/public_html/
 I ran rundig several times but those files under /home/user/public_html never got 
indexed.  It seems that htdig just skipped that part since I could not find any thing 
related from the script of "rundig -vvv".  Any suggestions/comments?  Could you help? 
 
 Thanks in advance.
 
 Kind regards,  
 
 Brian Chiangmailto:[EMAIL PROTECTED]
 Philips Research East Asia - Taipei 
 24FA, 66, Sec. 1 Chung Hsiao W. RdTel: +886 2 2382 4593
 PO Box 22978, Taipei 100, TaiwanFax: +886 2 2382 4598

I suggest that you run htdig -i again with the 
local_user_urls:
statment commented out.  That should reveal whether the 
problem really is with local file access or is somewhere 
else. 

------
David Adams
[EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Ampersand in URL

2000-01-17 Thread David Adams

I have found two problems with using htdig where the URL contains an
''.  I am using htdig version 3.1.2, so perhaps these are fixed problems?

The first is when the URL contains a bare '' and has to be passed to an
external parser.  For example, I get in the htdig log:

6:6:2:http://www.soton.ac.uk/~dja/timeline.ps: sh: line.ps:  not found
 size = 70146
 
The problem does not appear to be in parse_doc.pl.


The second is when a page's author has mistakenly marked up a bare '' in
the URL as 'amp;'.   This is - of course - wrong, and htdig does not
find the page.  For example:

11:11:2:http://www.soton.ac.uk/~dja/testamp;test2.html:  not found

However, Netscape Navigator 4.05 (and probably other browsers) fixes
this up and presents a link to http://www.soton.ac.uk/~dja/testtest2.html

I tried setting

translate_amp: true

in the configration file in the vain hope that this would produce a
similar fix.

Is there an alternative to trying to persuade authors that their URLs
are wrong even though they work with the usual browsers? 

Thanks.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] Custom factors?

1999-12-16 Thread David Adams

 
 On Thu, 16 Dec 1999, Simon Blake wrote:
 
  select name=url and /select.  Is this a straightforward way to achieve
  this?  Looking at the factor system, it struck me that a neat way to do
 
 This is a reasonable way to do this. Try the noindex_start attribute. See
 http://www.htdig.org/attrs.html#noindex_start
 
  this would be with a custom factor - you define the start and end tags,
  maybe with a regexp, and everything in between gets the relevant weight. 
 
 Right now custom factors aren't supported, but we're looking at how to do
 this sort of thing in the 3.2 code.
 
 -Geoff Hutchison
 Williams Students Online
 http://wso.williams.edu/
 

Is it possible to have more than one pair of noindex_start and noindex_end?

If so, what is the syntax?

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] parse_doc.pl alterations

1999-11-26 Thread David Adams

 According to David Adams:
  I have downloaded the parse_doc.pl script, and the xpdf and catdoc
  utilities, and I am now using them to extend our search index to include
  Word and PDF files.  It all works well and with a bit of alteration to
  the Perl script does exactly what I want.  My thanks to the developers!
 
 I forgot to ask before, what were your alterations?  Something very
 specific to your needs, or something worth sharing with other?
 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930

Well, since you ask, I noticed two problems with PDF files on our site:

1.  the titles were often meaningless, having no connection with
the contents.

2.  pdftotext outputs some spurious non-ascii gibberish that is 
then indexed.

I modified the code which outputs the title to always include the
type, and to put any extracted title in double quotes or the filename
in square brackets:

# if no title use filename from URL
if (not length($title)) {
$title = $ARGV[2];
$title =~ s#^.*/##;
$title = '[' . $title . ']';
} else {  
$title = '"' . $title . '"';
}
print "t\t$title ($type Document)\n";


To throw away the spurious "words" I simplified the code to replace
all non-alphanumerics with spaces.  I appreciate that many people would
think that too drastic:


while (CAT) {
while (/[A-Za-z\300-\377]-\s*$/  $dehyphenate) {
$_ .= CAT || break;
s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
}
$head .= " " . $_;
#s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\$
#s/[\255]/-/g;   # replace dashes with $
s/\W/ /g;   # replace non-alphanumeric characters with spaces
s/\s+/ /g;  # replace multiple spaces, etc. with a single space
@fields = split;# split up line
next if (@fields == 0); # skip if no fields (do$
for ($x=0; $x@fields; $x++) {  # check each field if s$
if (length($fields[$x]) = $minimum_word_length) {
push @allwords, $fields[$x];# add to list  
}
}
}

The spurious output is nolonger indexed, but it does remain in the head,
so there is further room for improvement.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] Reducing the importance of pages.

1999-11-25 Thread David Adams

 
 Is it possible to reduce the importance of certain pages?  We have
 some pages on our site that are directories and contain thousands of
 entries.  As a result they always seem to come up as top results
 whenever we search for anything.  I don't really want to remove these
 pages from a search but I would like them tol appear lower down the
 list.  Is this at all possible (perhaps by using negative weighting or
 similar?)?
 
 Thanks!
 
 -- 
 --
 Jason Carvalho
 Web Analyst
 Cranfield University
 [EMAIL PROTECTED]

You could increase the weighting of other pages by encouraging
the use of

META NAME="keywords" CONTENT="...list of keywords..."

and

META NAME="description" CONTENT="...relevent text..."

in their headers.  On our site we have increased the weighting
of keywords to 200.

You might consider not indexing the directory pages atall by placing

META NAME="robots" CONTENT="noindex"

in their headers.  Links in them will still be followed, but htdig
will not index the words in them.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] META name=robots

1999-11-09 Thread David Adams


I am using ht://Dig version 3.1.2 and I have been trying to prevent
some sets of pages from being indexed by inserting in the head of
certain index pages:

META NAME="robots" CONTENT="noindex, nofollow"

This only seemed to work in one case and not in others.  In the one case
where it _did_ work I found I had written:

META NAME="robots" CONTENTS="noindex, nofollow"

Can anyone independently confirm or deny this?  I would be happy
to learn I am mistaken and that ht://Dig does conform to the HTML
standard for META.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] Comparisons with SWISH++

1999-10-29 Thread David Adams


 
 
 Does anyone know of any benchmark tests comparing performance and
 functionality of SWISH++ vs ht://Dig?
 
 I only ask because a client has mentioned SWISH++ as a possible alternative.
 I guess that from the Web stats (I have heard very little mention of SWISH++
 in the outside world) there is the suggestion that it's a tool of choice. I
 had a look at it and it didn't seem to be terribly well documented (yeah,
 source-code documentation, but I want to USE the product, not mess around
 inside its code...). Just wondered really whether anyone had any OBJECTIVE
 comments they could add, or perhaps knew of somewhere I could maybe find out
 some of this?
 
 Regards,
 
 Phil Coates.

I've no experience of SWISH++ but we have been using SWISH, and more
recently SWISH-E, at this site for a few years. 

The original SWISH was only capable of indexing filestore, and given a
top directory would index all files, descending into all subdirectories. 
We use SWISH-E in this mode, which is complementary to ht://Dig and most
other search engines which follow links. 

SWISH-E has one facility which ht://Dig does not have and is very attractive
for some specialist applications:  it allows a search which will _only_
find pages which contain a given search word or words in a particular
META tag.  Thus you can have, for example, pages which contain METAs such as:

META NAME="Author" Content="Ransome, A."
META NAME="Title" Content="Swallows and Amazons"

and then allow a search on Author or on Title, etc.

Librarians and others doing systematic cataloguing find this a welcome
feature.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] Leading reasons for htdig not finding known matches?

1999-10-26 Thread David Adams


 
 Does anyone know what the leading reasons are for htdig not returning
 results for known matches.
 
 For Example:
 
 If I query the database and then get a result that has "champion" in the
 title and then try to search on "champion" it returns a "not found" result.
 This is a new database that just completed "rundig" so I wouldn't think
 there is a problem. Any ideas?
 
 Charlie
 

Four possibilities:

1)  "champion" is in the "bad word" list.

2)  The score for "Title" has been set to zero.

3)  The page has more than one Title ../title.

4)  You have hit a bug in htdig 3.1.2 which results in punctuation
in the page head not being stripped out.  If you have, for example:

Title"Champion" says Ray!/title

This may cause the "words":

"champion" 
says
ray!

to be indexed.


Can anyone positively confirm that that bug is fixed in 3.1.3 ?

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] Bad Words in the Search String

1999-09-30 Thread David Adams


The University of Southampton's main web server is now using ht:/Dig to
provide a search facility, http://www.search.soton.ac.uk/soton/, and we
are very pleased with it.  It is much better than the Harvest search
engine we had before. 

Looking through the search log I've noticed a small problem which I
would like advice on.  It appears that an "All" search on a search
string containing a word in the bad_word_list will fail.  For example, a
search on "The Staff Club" finds nothing, while a search on "Staff Club"
finds the Staff Club page.  More seriously for us, a search for 
"New College" fails, while a search on "College" finds the pages on
The University of Southampton New College.

Does this mean I have to prune the bad_word_list right down?  Is there
something else I could do?

It is interesting that a two-letter word in a search string does not
cause failure, even though the minimum word length is set at three.

-- 
 
David Adams
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] Logo design

1999-09-03 Thread David Adams


I gather that work is still going on to produce a new logo for ht://dig. 

Could I appeal to designers to include a very small image as well as a
main logo?

I have been asked to remove the current ht://dig logo from our new
search pages as it "dominates the page".  I think I could get away with
a thumb-nail sized image provided it was an official ht://dig graphic. 

-- 
 
David Adams
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] What is a word?

1999-09-03 Thread David Adams


 
 According to David Adams:
  I am using htdig 3.1.2, and my config file includes:
  
  extra_word_characters:  _
  valid_punctuation:  !@#$%^*()-+|~=`{}[]:";'?,./
  
  I find that the word database build by htdig includes many words that
  contain or end in a comma or other punctuation. For example:
  
  arts,   i:2514  l:1 w:49950
  assessed,   i:2523  l:1 w:49950
  atmospheric,i:2529  l:1 w:49950
  b.sc,   i:120   l:1 w:49950
  b.sc,   i:16406 l:1 w:49950
  b.sc,   i:16409 l:1 w:49950
  b.sc,   i:3039  l:1 w:49950
  b.sc,   i:3040  l:1 w:49950
  b.sc,   i:3041  l:1 w:49950
  ba, i:17l:1 w:49950
 
 I believe part of the problem may be the left quote (`) character
 in the list above, which is taken as the start of a file expansion
 (e.g. `filename`).  As there's no file called "{}[]:";'?,./", the left
 quote and everything after it is lost from the valid_punctuation list.
 You'd need to escape the left quote with a backslash (\).  The same
 thing goes for the dollar sign ($), only in this case it's just that
 one character that's lost.
 
 Still, that wouldn't explain why the comma and period get entered into
 the database.  This would suggest that those characters were in the
 extra_word_characters list, or were erroneously treated as alphanumeric
 by your locale's LC_CTYPE tables.
 
  Am I misunderstanding the documentation on "valid_punctuation"?
  
  I can't figure out how the configuration file attributes 
  
  extra_word_characters 
  and
  valid_punctuation 
  
  work together.  What happens when the same character is in both?
 
 The lists should not overlap, but if they do, I believe valid_punctuation
 overrides, so the overlapping characters do get stripped out of the word.
 
 Essentially, both lists indicate which punctuation marks or other
 characters can be used within a word, but the valid_punctuation characters
 get stripped out before the word is put in the database.  E.g. words like
 post-doctoral and nutsbolts go into the database as postdoctoral and
 nutsbolts, unless you move the hyphen or ampersand from valid_punctuation
 to extra_word_characters, in which case the characters stay in the word.
 
 Additionally, with the compound word patch I posted last week, and which
 will be in future releases, the word will be split up at places that
 have a non-alphanumeric character that's in valid_punctuation, but not
 in extra_word_characters.  Thus, a word like post-doctoral will go into
 the database as postdoctoral, post and doctoral.
 
  Why doesn't the documented list of default characters for
  valid_punctuation include the question mark (?) and the doublequote (")?
 
 This is because these characters aren't commonly used within words,
 unlike apostrophes, ampersands, hyphens and slashes.  Also, when you set
 allow_numbers to index numbers as words, these numbers may contain some of
 these characters:  .-/#$% , and that's why they're in the default list.
 I don't know why _!^ are in the default list, but I suspect they may be
 used for indexing source code.  If a given punctuation mark should ALWAYS
 separate words, it should not be added to this list.
 
  What separates words, is it whitespace only?
 
 White space or any punctuation character (actually, any non-alphanumeric
 character) not listed in extra_word_characters or valid_punctuation.
 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
 

Thanks for you swift and full response Gilles, I was certainly mistaken
as to the use of valid_punctuation.

I am left with four points:


1)  Documentation

The ht://dig documentation is excellent, but could I suggest the
following text to replace the "description" of valid_punctuation in the
online documentation:

Any punctuation character (that is, any non-alphanumeric character,
see allow_numbers) not either in extra_word_characters or
valid_punctuation is treated the same as a space - it merely
acts as a word separator.

However, when a valid_punctuation character occurs within a word
it is removed leaving a single word.

For example, if the minus sign is in valid_punctuation, then the
word "post-war" will be indexed as "postwar", and a search for
either "post-war" or "postwar" will find it.  However, if the minus
sign is not in valid_punctuation then "post-war" will result in
"post" and "war" being indexed instead.


2)  Characters in valid_punctuation

Not only should I have had \` and \$ in valid_punctation but I should
not have included the star (*) atall.  This is the default
prefix_match_charact

[htdig] What is a word?

1999-09-02 Thread David Adams


I am using htdig 3.1.2, and my config file includes:

extra_word_characters:  _
valid_punctuation:  !@#$%^*()-+|~=`{}[]:";'?,./

I find that the word database build by htdig includes many words that
contain or end in a comma or other punctuation. For example:

arts,   i:2514  l:1 w:49950
assessed,   i:2523  l:1 w:49950
atmospheric,i:2529  l:1 w:49950
b.sc,   i:120   l:1 w:49950
b.sc,   i:16406 l:1 w:49950
b.sc,   i:16409 l:1 w:49950
b.sc,   i:3039  l:1 w:49950
b.sc,   i:3040  l:1 w:49950
b.sc,   i:3041  l:1 w:49950
ba, i:17l:1 w:49950

Am I misunderstanding the documentation on "valid_punctuation"?

I can't figure out how the configuration file attributes 

extra_word_characters 
and
valid_punctuation 

work together.  What happens when the same character is in both?

Why doesn't the documented list of default characters for
valid_punctuation include the question mark (?) and the doublequote (")?

What separates words, is it whitespace only?

Thanks

-- 
 
David Adams
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] htmerge -v output

1999-08-18 Thread David Adams


I have just started using ht://Dig and I have lots of questions,
but here is just one that is bothering me.

When I run htmerge with the -v option I get lots of lines beginning:

Deleted, no excerpt: 

what does this mean?  Is that page not indexed?  For what reason(s)
would there be no excerpt for a page, and should it bother me?

-- 
 
David Adams
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.