Re: [htdig] Question about parsing word, pdf, ppt etc.

2000-12-19 Thread David Adams

Try executing the parsers at the command line to see what happens.

I don't know, but it seems quite possible that the current version of
ppt2html is not able to cope with the Powerpoint 2000 format.  If that is
the case you could try contacting the author directly.  I have noticed that
ppt2html can require a lot of memory (several hundred megabytes) to convert
some .ppt files, could you have a problem with a shortage of memory?

Are you using catdoc or wp2html to convert Word files?  Wp2html extracts the
'subject' from the document summary and puts it in the header, which might
be the problem.  Catdoc does often include gibberish in its output, and you
could find removing the -b option in the call of catdoc an improvement.

Doc2html.pl uses pdfinfo to extract the title of the .PDF file, and I have
seen .PDF documents where the title is 'þÿ ' for some reason.  You might
need to modify doc2html.pl to supress such titles.

- Original Message -
From: "Aditya Shah" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tuesday, December 19, 2000 2:28 AM
Subject: [htdig] Question about parsing word, pdf, ppt etc.


 Hello,

 We are evaluating the use of htDig for an intranet site. Our users publish
a
 lot of Word, Excel, Powerpoint and PDF Documents that we want to be able
to
 search through.

 We have been able to get all the external parsers required. We have run
into
 the following issues:

 1) Unable to parse powerpoint Documents. The documents are MS- Powerpoint
 2000 Documents. We got the ppt2html parser from www.xlHtml.org . The
 statements in htdig.conf are something like this:

 application/msexcel-text/html /app/doc2html/doc2html.pl \
   application/mspowerpoint-text/html
 /app/doc2html/doc2html.pl \
   application/vnd.ms-excel-text/html
 /app/doc2html/doc2html.pl \
   application/vnd.ms-powerpoint-text/html
 /app/doc2html/doc2html.pl

 Excel works great, but for powerpoint, when I run the 'rundig' program, it
 just kind of hangs there.

 2) Getting gibberish in the headers for some word and pdf documents. For
 example, for a word document:

 In doc 2 html ; ; ; ; ; ; ; ;   Fax Fax Please Recycle Comments: `Þ"Û?
 gP?]...øu-OwPÄ?+`É?0|?(ÜÐ oè?UYÆìÌ{èO?ãôrsÊ-| ?]ç* ú! Ý^mÀB?t
 5?z+¿Hc-Ð#*ÄgÔ"C?ò,mÎ?Púss (_ûÛ~$Û+-V Sö?ýô?_+ywì?lt;?-? ?\...Y ...

 when the search results are returned. This does not happen for all word
 documents, only for some of them.

 And for a PDF document, we always get the 'þÿ ' character before any file
 name in the search results section.

 Also, do you know if there is a parser for MS-Visio?

 Any help would be appreciated.

 Thanks.

 Aditya Shah


 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




RE: [htdig] Hi, need help with searching database.

2000-12-19 Thread Akshay Guleria



-Original Message-
From: Gilles Detillieux [mailto:[EMAIL PROTECTED]]
Sent: Monday, December 18, 2000 10:10 PM
To: [EMAIL PROTECTED]
Cc: ht://Dig mailing list
Subject: Re: [htdig] Hi, need help with searching database.


According to Akshay Guleria:
 thanx Gilles,
 However my problem is now fixed. I am using the following.
 htdig-3.2.0-0.b2.i386.rpm

Ah, I wasn't aware that this version was out in RPM.  There are many
known bugs in 3.2.0b2, so don't be surprised if other problems occur.
The scoring bug in htsearch is very likely to turn up, unless this
RPM included a patch for this bug.

 There were 2 problems (in case you are interested):
 1. htaccess files not allowing the rundig to connect to the server.

|There's not much htdig can do about this.  If the .htaccess file sets
|up Basic authorization, then you can use the -u option to provide the
|user name and password to the server.  If the .htaccess file set up
|some other restrictions, you're out of luck, but then these pages would
|also be inaccessible from a standard web browser coming in from the same
|address.

Infact I added "Allow from localhost" lines in my htaccess and it works.
Just a work arround.

 2. This file in /var/lib/htdig needed webserver owner's ownership.
 db.words.db_weakcmpr

 As soon as I owned it by "apache", it worked. I dont know but I think the
 rpm packager should have have taken care of this.
 Anyway, it works now.

 Thanx a lot for getting back.

|This point was covered in FAQ 5.17.  I didn't realize you were running
|a 3.2 beta before.  It's very important to mention which version you're
|running, because many, if not most, of the bugs and problems that come
|up are version-specific.  There's not much the RPM packager can do
|about this particular bug in htdig, because the db.words.db_weakcmpr
|file is not normally part of the RPM distribution - it's only created
|after installation, when you run the rundig script.

Oops! I had gone through the FAQ and this particualr point too but Goosh! I
had missed out the second paragraph just because I was in a hurry to go
through the complete FAQ.

Thanx a lot for pointing that out.

 -Original Message-
 From: Gilles Detillieux [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, December 13, 2000 11:20 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: [htdig] Hi, need help with searching database.


 According to Akshay Guleria:
  I just installed Redhat7.0 on my machine. And then installed htdig rpm.
I
  can see the page
  http://myhost/htdig/ which is the search page.

 Which htdig rpm did you install?  For Red Hat 7.0, you should use the RPM
 for htdig-3.1.5-6 that comes with the 7.0 PowerTools.

  I make a search and for any search I make, it returns a page saying
  "No matches found for ... "
 
  Now, I ran rundig and it increased the file sizes in /var/lib/htdig. So,
I
  presume the database was created. And then I ran htmerge. But I still
get
  the
  "No matches found .." page.

 If you run rundig, you don't need to run htmerge separately.  The rundig
 script will run htdig followed by htmerge.  You should try running
 your /var/www/cgi-bin/htsearch program right from the command line
 first, to see if that works.  If it does, it may be an Apache server
 configuration problem, or a problem with your search form.  Did you
 make any changes to the /var/www/html/htdig/search.html search form?
 If so, see http://www.htdig.org/FAQ.html#q5.17

 --
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:
 http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930

 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html



--
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:
http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Going for the big dig

2000-12-19 Thread Gilles Detillieux

According to Terry Collins:
 Geoff Hutchison wrote:
  
  At 10:14 AM +1100 12/19/00, Terry Collins wrote:
  And make sure you don't ignore robots.txt
  
  Yes, though someone would need to alter the code to do this.
 
 If you are doing an external site, it shouldn't be to much effort to
 just read this and set the excludes.
 
 Courtesy thing.

I think you misunderstood.  htdig already does read the robots.txt file
and skips all disallowed documents.  You don't need to do this manually.
Geoff was saying you'd need to alter the code in order to ignore robots.txt,
which definitely would be a bad thing if you then use the hacked htdig to
index sites that are not your own.

Actually, on my site I don't bother with exclude_urls at all, and use the
robots.txt file instead.  This way, anything that I don't want indexed by
htdig won't be indexed by any other search engine either.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Going for the big dig

2000-12-19 Thread Geoff Hutchison

On Tue, 19 Dec 2000, Gilles Detillieux wrote:

 Geoff was saying you'd need to alter the code in order to ignore robots.txt,
 which definitely would be a bad thing if you then use the hacked htdig to
 index sites that are not your own.

Yes, and while it may or may not be easy to do, it will never be an option
to ignore the robots.txt file. (And it will never be an option to ignore
robots META tags either.)

 Actually, on my site I don't bother with exclude_urls at all, and use the
 robots.txt file instead.  This way, anything that I don't want indexed by
 htdig won't be indexed by any other search engine either.

True, though other search engines usually also ignore certain patterns
(e.g. cgi-bin). I also heavily use the META robots tag, though these are
not as old a standard and sometimes are still ignored.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Going for the big dig

2000-12-19 Thread Geoff Hutchison

On Tue, 19 Dec 2000, David Gewirtz wrote:

 on something. I attempted to index a remote site, in this case Lotus.com. 
 Now, I have no idea how many pages that is. But I let the index process run 

If you have no idea how many pages will be on a server, I'd start with a
set max_hop_count or server_max_docs limit and go from there. These
attributes are meant to keep the dig from spiralling out of control (or in
this case, out of the limits of your server).

http://www.htdig.org/attrs.html#max_hop_count
http://www.htdig.org/attrs.html#server_max_docs

 handle it. Right now, I'm thinking the process is too big. Can htdig and/or 
 htmerge running on a 258MB or 384MB machine handle indexing/merging sites 

This question is a bit hard to answer. From what you said, the answer is
"no," but I can't give a better answer unless there's at least an estimate
of the number of URLs, as I mentioned earlier.

There are also simple "link checker" scripts which can give you a count of
the number of URLs on a site.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Going for the big dig

2000-12-19 Thread Terry Collins

Gilles Detillieux wrote:

...snip...

 I think you misunderstood.  htdig already does read the robots.txt file
 and skips all disallowed documents.

Woops, my apologies for that gaff, my brain has started the holiday
season without me {:-).
Actually, I given up remembering how you do/I did anything under linux -
with versions every three months, it is all different everytime I look
at something. 

You are correct about that as I now remember having to look at this in
detail as my robots.txt excludes all the lists I archive on site from
indexing bots and htdig very obediently acted on this. I wanted htdig to
actually index the contents of these lists, but exclude everything else,
which it now does quite nicely.


 Actually, on my site I don't bother with exclude_urls at all, and use the
 robots.txt file instead.  This way, anything that I don't want indexed by
 htdig won't be indexed by any other search engine either.

I wish all search engines did obey robots text.

Thanks for the development effort with htdig. Very useful app.

--
   Terry Collins {:-)}}} Ph(02) 4627 2186 Fax(02) 4628 7861  
   email: [EMAIL PROTECTED]  www: http://www.woa.com.au  
   WOA Computer Services lan/wan, linux/unix, novell

 "People without trees are like fish without clean water"


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] Data mining on Htdig DB

2000-12-19 Thread Laurent


Hi everyone.

I'm looking for a way to do some data analysis on htdig db. Unfortunately I
have no good idea on how to do that. My point is to track changes over time
in the language used on a series of specialized web sites.

At first I thought that using htdig with MySQL instead of the builtin
Berkeley DB would be the solution. But I saw that their is no support for
this patch anymore.
Then I figured that a text dump would do the trick, but I'm still missing
some information like URL, document title etc. etc.

Is there a better way to do it? Did someone already do it?

Thank you,
Laurent



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Data mining on Htdig DB

2000-12-19 Thread Geoff Hutchison

On Tue, 19 Dec 2000, Laurent wrote:

 I'm looking for a way to do some data analysis on htdig db. Unfortunately I
 have no good idea on how to do that. My point is to track changes over time
 in the language used on a series of specialized web sites.
[snip]
 some information like URL, document title etc. etc.

In the 3.1.x code, the db.wordlist file is already text. The -t flag to
htdig will dump an ASCII version of the document DB in a specified format:

http://www.htdig.org/htdig.html

In the 3.2 code, you can use the htdump program to generate these files.

Is this sufficient for your needs?

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html