Re: [htdig] questions about the search engine

2001-01-04 Thread Geoff Hutchison

On Thu, 4 Jan 2001, Edward Lu wrote:

> We need to know more information about the search engine. 
> 1. Can it crawl JSP pages?

Sure.

> 2. Is it running by its own web server?

No.

> 3. Any customization document available?

You can customize to your heart's content. See the documentation, esp. the
FAQ.


> 4. Is it free?

In all meanings of the word. The code is covered under the GNU General
Public License (GPL).

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] questions about the search engine

2001-01-04 Thread Edward Lu



Hi 
htdig,
We need to know more 
information about the search engine. 
1. Can it crawl JSP 
pages?
2. Is it running by 
its own web server?
3. Any customization 
document available?
4. Is it 
free?
 
Looking forward your 
early reply.
 
Thanks!
 
-Ed
 


  
  

  

  

 

  Edward Lu
  ConsultantFort Point Partners 
Inc.
  

  
Builders of Internet Solutions that Sell Harder111 Sutter St, 
  22nd Floor, San Francisco, CA 94104
  

  
tel 
(415) 
  762-3751
[EMAIL PROTECTED]
  
fax 
(415) 
  395-4783
http://www.fortpoint.com
 


Re: [htdig] questions

2001-01-04 Thread Geoff Hutchison

On Thu, 4 Jan 2001, Gilles Detillieux wrote:

> some people they install quite smoothly.  There may be database format
> changes coming down the road, so upgrading htdig may mean having to
> reindex from scratch.

I can't think of many changes that would truly require reindexing from
scratch. AFAICT, the worst that would be needed is to use htdump, upgrade
to the new version, run htload, and you're off and running. A bit more
convenient. ;-)

> htdocs is commonly used as the name for the DocumentRoot, but I don't
> think there's any standard involved here.

It's the precedent from NCSA httpd. That's what they called it and Apache
started as a patched httpd, so it's stuck around.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] questions

2001-01-04 Thread Gilles Detillieux

According to John Lunstroth:
> 
> Hi - I am a beginner here and have some more questions. Apologies for
> taking up time on some of this.
> 
> 1. Ultimately I am interested in phrase and proximity search
> capabilities. I have been working at installing 3.1.5. I have read
> the release notes and see that the beta of 3.2 should be installed
> separately etc., since it uses different protocols, etc. I am wondering
> if I should just go ahead and start working with 3.2 - will it be
> difficult to upgrade between fixes?

That depends on how important phrase searching is to you.  There are
still a lot of bugs in the query parser in 3.2, and a rewrite of it
is slated for 3.2.0b4, so I don't know if the current phrase searching
will be adequate.  As far as upgrading between fixes, there's commonly
a bit more effort involved in working with beta releases, although for
some people they install quite smoothly.  There may be database format
changes coming down the road, so upgrading htdig may mean having to
reindex from scratch.  Switching between 3.1.5 and 3.2.0bX will certainly
require reindexing.

> 2. Being new to the Lunix environment, I am just getting acquainted with
> the File Hierarchy Standards ideas, and am uncertain how important it is
> to follow based on the following. I have noticed that the FHS applies
> to administrators setting up systems, but I alos notice that my web
> host has other protocol in place for the section of the server I have
> access to. For example, the root of my server (I only have telnet/ftp
> access), has a /www directory that contains all of the websites, and a
> /home directory that is my home directory and contains all other home
> directories. I have real space in each subdirectory under my domain
> name - /www/myname/ and /home/myname. There is a link from /home to
> /www. This at first caused me some difficulty, but I got it figured out.
> 
> The htdig configuration program/file assume that the website will be
> located under the server's /opt subdirectory - so configure by default
> produces files with this location: /opt/www. "opt" is the name of
> subdirectories in which the user should put non-system programs -
> their applications, if I udnerstand correctly. There are also the
> "var" and "bin" subdirectories.
> 
> Is there a recommended file hierarchy I should use in the
> directory I have available? I am building my site in the
> /www/myname/ subdirectory. That is where cgi-bin is located
> (/www/myname/cgi-bin). Will it be easier, in the long run, to use a
> certain file system - I assume it will be, since htdig, and probably
> other apps, use a common base of standards I am unfamiliar with.

Don't worry too much about the FHS.  It's meant for people putting
together packages for distribution.  End users are not bound by it,
and individual system installations may go with something very different
in many cases.  Go with what works.  If your web hosting company imposes
a different hierarchy, the easiest thing in the long run is to go with
their setup as much as possible.  It's easy enough to configure htdig
to use any set of directories you want.

I think the whole /opt thing is a Sun-ism that may have been adopted
by some (but certainly not all) Linux distributions.  On Red Hat systems,
I go with something more FHS-like.

> I am asking a narrow question - what subdirectories would it be best
> to use in setting up htdig:
> 
> /www/myname/opt (put htdig here as separate subdirectory
>  /cgi-bin (htdig automatically puts stuff here
>  /var (? not sure what to put here
>  /bin (? not sure how to use this one - or
>  even where it should be vis-a-vis htdig
>  /htdocs (? is the name important or standard?
> 
> the subdirectory "htdocs" - I assume this means hypertext docs -
> and should be where the content is - is "htdocs" a standard name,
> or an abbreviation used by the htdocs people?

htdocs is commonly used as the name for the DocumentRoot, but I don't
think there's any standard involved here.  Go with whatever your web host
uses as its document root directory, and put the "htdig" subdirectory that
contains the image files right in that directory.  Put htsearch in your
web host's cgi-bin directory if at all possible (and if one is provided),
to avoid having to specify a new ScriptAlias directory for CGI programs.
The rest of the files (executables, common/* files, database directory)
can go wherever you see fit, but make sure the common and database
directories are accessible by the web server's user ID.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a m

[htdig] questions

2000-12-18 Thread John Lunstroth

Hi - I am a beginner here and have some more questions. Apologies for taking up time 
on some of this.

1. Ultimately I am interested in phrase and proximity search capabilities. I have been 
working at installing 3.1.5. I have read the release notes and see that the beta of 
3.2 should be installed separately etc., since it uses different protocols, etc. I am 
wondering if I should just go ahead and start working with 3.2 - will it be difficult 
to upgrade between fixes?

2. Being new to the Lunix environment, I am just getting acquainted with the File 
Hierarchy Standards ideas, and am uncertain how important it is to follow based on the 
following. I have noticed that the FHS applies to administrators setting up systems, 
but I alos notice that my web host has other protocol in place for the section of the 
server I have access to. For example, the root of my server (I only have telnet/ftp 
access), has a /www directory that contains all of the websites, and a /home directory 
that is my home directory and contains all other home directories. I have real space 
in each subdirectory under my domain name - /www/myname/ and /home/myname. There is a 
link from /home to /www. This at first caused me some difficulty, but I got it figured 
out. 

The htdig configuration program/file assume that the website will be located under the 
server's /opt subdirectory - so configure by default produces files with this 
location: /opt/www. "opt" is the name of subdirectories in which the user should put 
non-system programs - their applications, if I udnerstand correctly. There are also 
the "var" and "bin" subdirectories. 

Is there a recommended file hierarchy I should use in the directory I have available? 
I am building my site in the /www/myname/ subdirectory. That is where cgi-bin is 
located (/www/myname/cgi-bin). Will it be easier, in the long run, to use a certain 
file system - I assume it will be, since htdig, and probably other apps, use a common 
base of standards I am unfamiliar with. 

I am asking a narrow question - what subdirectories would it be best to use in setting 
up htdig:

/www/myname/opt (put htdig here as separate subdirectory
 /cgi-bin (htdig automatically puts stuff here
 /var (? not sure what to put here
 /bin (? not sure how to use this one - or even where it 
should be vis-a-vis htdig
 /htdocs (? is the name important or standard?

the subdirectory "htdocs" - I assume this means hypertext docs - and should be where 
the content is - is "htdocs" a standard name, or an abbreviation used by the htdocs 
people? 

Thanks

John





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Questions....

2000-12-16 Thread Geoff Hutchison

At 6:01 PM +0100 12/15/00, [EMAIL PROTECTED] wrote:
>I wonder how the search engine works. Is it possible to truncate words
>(e.g.techn*), and how then?

Yes, using the prefix or substring fuzzy algorithms:


>Can you search only for endings of words, like (e.g. *echnical)?

Yes, with the substring fuzzy algorithm.

>And how do you search with boolean expressions?

Set method to "boolean" in either the form or the config file, then 
you can do whatever you want, e.g.:

(dog and cat) not (snake or turtle)

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] Questions....

2000-12-15 Thread anders . stroemberg

Hi,
I wonder how the search engine works. Is it possible to truncate words
(e.g.techn*), and how then?
Can you search only for endings of words, like (e.g. *echnical)? And how do
you search with 
boolean expressions? How do you combine boolean expressions? We are using ht
dig in
The Swedish National Road Administration (SNRA, in swedish called
VÄGVERKET),
and I´m about to write a manual for the search engine so the visitors to our
website can use it.
http://www.vv.se
Thank you very much, and I am looking forward to some answers now.
Good bye from:

Anders Strömberg
Vägverket HK
Avdelning Vägutformning och trafik
Kontoret för vägutformning
781 87  BORLÄNGE
Tel. nr. 0243-753 01, Fax nr. 0243-758 34
e-post: [EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] questions about htdig

2000-07-27 Thread Gilles Detillieux

According to ti980247:
> Hi.. I'm a newcomers in this searching stuff.  I already installed htdig on my
> mandrake 7.0, php 4.0, apache1.3.12., everything going fine until I tried to
> indexing my server. 
> 
> I change htdig.conf and change the url into my website.
> I run ./htdig -h 5 -s  but It returns
> htdig:  my.web.server:80 1 document
> 
> then i checked the wordlist file.. it's very short, I think something wrong
> when it index my web cause my web contains 3975 html files. 

Try adding "-i -vvv" to the above htdig command, and look for clues in the
verbose output.  For some reason, it's not going beyond the start_url.
My guess is that your limit_urls_to is too restrictive.  It defaults
to the same value as start_url, so if you set the latter to the URL
of a single page, rather than the main URL for a site or subdirectory,
that's all you get unless you set limit_urls_to more liberally.

> my html file always link in dynamic not static link (e.g a href="../h.html"
> instead of a href="http://my.web.com/h.html")

These are both static links.  The first is relative, the second is
absolute.  htdig can properly handle both.  Dynamic links are those
constructed by the browser software, e.g. from JavaScript code, which
htdig will not handle.

> Any idea ?? is htdig index depend on link on files or depend on files on my web
> directory ??

Links in files, exclusively.  htdig will NOT look at directories, unless
the web server feeds them to htdig as HTML documents containing links
to files (which web servers like Apache commonly do when there's no
index.html).

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] questions about htdig

2000-07-27 Thread ti980247

Hi.. I'm a newcomers in this searching stuff.  I already installed htdig on my
mandrake 7.0, php 4.0, apache1.3.12., everything going fine until I tried to
indexing my server. 

I change htdig.conf and change the url into my website.
I run ./htdig -h 5 -s  but It returns
htdig:  my.web.server:80 1 document

then i checked the wordlist file.. it's very short, I think something wrong
when it index my web cause my web contains 3975 html files. 
my html file always link in dynamic not static link (e.g a href="../h.html"
instead of a href="http://my.web.com/h.html")

Any idea ?? is htdig index depend on link on files or depend on files on my web
directory ??

gee... I'm very confuse about this...

solutions and ideas would be appreciated...

thanks,

Samuel



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Questions please

2000-03-14 Thread Geoff Hutchison

On Tue, 14 Mar 2000, SiberSpace International Marketing wrote:

> We need a search engine that we will use on our site for about 30,000
> web sites relating to a certain field.

Let me get this straight. You want to index all the pages of these
30,000 or so websites. People will come to your page and search for them
to get individual pages that are a part of these sites. Do I have that
right, or do you want a catalogue of these 30,000 sites along the lines of
Yahoo? In other words, do you want full-text searching (which is what
ht://Dig provides)?

> Can the software also work with Hebrew and Hebrew web sites?

I haven't tried it myself, but if there's an 8-bit encoding for Hebrew and
a valid locale, it might work. Others may know from experience.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



[htdig] Questions please

2000-03-14 Thread SiberSpace International Marketing

Hello

We need a search engine that we will use on our site for about 30,000
web sites relating to a certain field.

We will have an ISP that people will connect to us and use our search
engine for a list of web sites we will provide the url's for. Other's
can also enter the web site from outside of our ISP by going directly to

our web site.

Can this software help us?

Can the software also work with Hebrew and Hebrew web sites?

Thank you in advance!

Brad Fogel






To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



[htdig] questions about indexing

2000-02-22 Thread Walter Addison March

Hi there...

We are running htdig 3.1.3 and we wanted a little clarification about when
new items are added to the databases and old stuff is removed.

It is my understanding that new stuff (as long as it is linked somewhere!)
is added during any update but that stuff that has been deleted is only
removed during an initialization of the databases.  Is this correct?

Thanks!
--
Walter Addison March
Web Administrator/Programmer
Academic Computing, Haverford College



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] questions about htdig

1999-09-28 Thread David Robley


On 28 Sep, Andy Malato wrote:
> 
> Hello,
> 
> I've installed htdig 3.1.3 on my BSDI 3.1 system.  I ran htdig and it
> doesn't seem to index my entire site.  I am unsure of why this is.  I've played
> around with all settings in the config file, especially max_head_length and
> max_head_size, i have these values set to 15 and 30 respectively.  I
> however cannot get it to index more than three documents of my site.
> 
> Does anyone have any ideas?
> 
> ---Andy

Check your start_url and limit_urls_to entries to ensure there is no
conflict. Try running htdig with some degree of verbosity (-vvv) and see
what the output tells you.

Cheers
-- 
David Robley

WEBMASTER   | Phone +61 8 8374 0970
RESEARCH CENTRE FOR INJURY STUDIES  | http://www.nisu.flinders.edu.au/
AusEinet| http://auseinet.flinders.edu.au/
Flinders University, ADELAIDE, SOUTH AUSTRALIA
Visit the PHP mirror at http://au.php.net:81/

< WARNING * END OF TEXT * STOP READING HERE >>



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] questions about htdig

1999-09-28 Thread Andy Malato


Hello,

I've installed htdig 3.1.3 on my BSDI 3.1 system.  I ran htdig and it
doesn't seem to index my entire site.  I am unsure of why this is.  I've played
around with all settings in the config file, especially max_head_length and
max_head_size, i have these values set to 15 and 30 respectively.  I
however cannot get it to index more than three documents of my site.

Does anyone have any ideas?

---Andy


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] questions re: restrict

1999-09-02 Thread Torsten Neuer


According to Ronaye:
>Hi there,
>
>I hope I am posting to the correct listing. I am new to HtDig, let alone
>to the HtDig mail listing. I have a question though...
>
>I would like to have the "restrict" expand to an HTML menu of all the
>available URLs that I identified in my original search page and have it
>default to the last one selected. I don't want to use the "hidden" tag.
>
>For example, the initial search page may provide a choice like:
>
>
>--- All of BCIT ---
>value="http://www.bcit.bc.ca/Programs/PT_programs/">Part-time studies
>value="http://www.bcit.bc.ca/Programs/FT_programs/">Full-time studies
>value="http://www.bcit.bc.ca/~bookstor/">Bookstore
>value="http://www.bcit.bc.ca/~housing/">Housing
>value="http://www.lib.bcit.bc.ca/">Library
>
>
>On the search result page I would like to show this same drop down menu,
>but displaying the user's last selected option. Is this possible and if
>so how?
>
>Can anyone steer me in the right direction?
>
>Many thanks
>Ronaye Ireland

Displaying the user's last selected option on the result page requires
modifying the position of the "selected" attribute.  This can only be
achieved through wrapping the htsearch program with SSI, some Perl CGI
or a server-parsed result page (e.g. through PHP).


hth,
  Torsten

--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14Tel: +49-4101-403605
D-25474 EllerbekFax: +49-4101-403606
E-Mail: [EMAIL PROTECTED]Internet: http://www.inwise.de


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] questions re: restrict

1999-09-02 Thread Ronaye


Hi there,

I hope I am posting to the correct listing. I am new to HtDig, let alone
to the HtDig mail listing. I have a question though...

I would like to have the "restrict" expand to an HTML menu of all the
available URLs that I identified in my original search page and have it
default to the last one selected. I don't want to use the "hidden" tag.

For example, the initial search page may provide a choice like:


--- All of BCIT ---
http://www.bcit.bc.ca/Programs/PT_programs/">Part-time studies
http://www.bcit.bc.ca/Programs/FT_programs/">Full-time studies
http://www.bcit.bc.ca/~bookstor/">Bookstore
http://www.bcit.bc.ca/~housing/">Housing
http://www.lib.bcit.bc.ca/">Library


On the search result page I would like to show this same drop down menu,
but displaying the user's last selected option. Is this possible and if
so how?

Can anyone steer me in the right direction?

Many thanks
Ronaye Ireland



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] Questions about what's possible with ht://Dig...

1999-07-06 Thread Geoff Hutchison


Albert Lunde wrote:
> From what I've read so far ht://Dig seems like a pretty flexible spider;
> which could be configured to spider remote systems, or to access the local
> server directly.

Yes on both counts.

> It sounds like http://www.htdig.org/files/contrib/scripts/multidig.tar.gz
> might be useful for running a series of indexes on various servers.

That is it's intent. It also makes merging indexes fairly easy.

> (1) Is the only way to deal with queries across multiple indexes to combine
> the indexes with htmerge, or is there a way to query more than one index
> and aggregate the results?

There is currently not any was to aggregate indices. You must merge
them.

> (2) Can your data files be copied between systems (e.g. doing local
> indexing on one server, then copying with ftp or scp to another server for
> merging or searching)? I can think of several sorts of issues:
>   - absolute path names
>   - byte order or floating point across archtectures

Pathnames are never used in the databases. Byte order is a snag,
however, but you can use the standard Berkeley DB tools to dump the
database to a text file, then reload it on another machine. Not elegant,
but it works.

> (2) Is there a way to index all the HTML files in a directory tree,
> regardless of how they are linked, (or some other arbitrary list of files
> on the local system)?

Nope. It follows links, so if there isn't a link to it, it won't find
it. This way you can actually have "private" directories that aren't
indexed...

> (3) Is it feasible to use the ht://Dig spider with some different search
> and index software?

I don't see how you'd use it with a different indexer. The spider is the
indexer. Searching might be possible, but you're probably much better
off writing a wrapper in Perl, PHP or something else--I don't know of
anyone else who reads the ht://Dig database format.

> I guess the last two questions depend on what the interface is between the
> spider and indexing software: to what extent it is exported in a form that
> external software could be added or to what extent the whole package is too
> interconnected to pick apart.

You can add external parsers for different file formats. In 3.2, you'll
be able to add external transport protocol helpers, and hopefully
external "decoders" for decompression, decryption, etc. There are also
any number of wrappers in Perl and PHP, several of which may be found on
the contrib/ section of the website.

> If you'd care to comment on the pros or cons of any of this, I'd be interested.

There are obviously a number of search tools sites comparing these
products, including http://www.searchtools.com/ 

-- 
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.



[htdig] Questions about what's possible with ht://Dig...

1999-07-06 Thread Albert Lunde


I'm looking at several freeware software packages to see what would be more
useful to do campus-wide indexing at our University.

(We'd originally planned to use the commercial OpenText software, but
current versions of that software seem to be too tightly integrated with
their document management system.)

What we'd like to do is remotely spider a number of small servers in
rotation, (say once every week or two) while indexing some larger servers
thru the file system (say nightly), then some how do a single query to
search all those indexes.

We _don't_ want to mirror all the HTML for all the servers, all the time,
just store the indexes. (Total disk space is a limiting resource.)

>From what I've read so far ht://Dig seems like a pretty flexible spider;
which could be configured to spider remote systems, or to access the local
server directly.

It sounds like http://www.htdig.org/files/contrib/scripts/multidig.tar.gz
might be useful for running a series of indexes on various servers.

I have a few questions about what is possible:

(1) Is the only way to deal with queries across multiple indexes to combine
the indexes with htmerge, or is there a way to query more than one index
and aggregate the results?

(2) Can your data files be copied between systems (e.g. doing local
indexing on one server, then copying with ftp or scp to another server for
merging or searching)? I can think of several sorts of issues:
  - absolute path names
  - byte order or floating point across archtectures

(In our environment, most of our Unix web servers are running HP-UX, so
processor architecture isn't a big problem, but I'd like to know if it's an
issure, for future reference, anyway.)

(2) Is there a way to index all the HTML files in a directory tree,
regardless of how they are linked, (or some other arbitrary list of files
on the local system)?

(3) Is it feasible to use the ht://Dig spider with some different search
and index software?

I guess the last two questions depend on what the interface is between the
spider and indexing software: to what extent it is exported in a form that
external software could be added or to what extent the whole package is too
interconnected to pick apart.

Other software I'm looking at with the same concerns in mind are:

SWISH-E:
http://sunsite.berkeley.edu/SWISH-E/

The revived "Harvest" software from:
http://www.tardis.ed.ac.uk/harvest/

"Combine" together with "Zebra":
http://www.lub.lu.se/combine/
http://www.indexdata.dk/zebra/


If you'd care to comment on the pros or cons of any of this, I'd be interested.

Direct replies to me or the list, as you think appropriate.
---
Albert Lunde  [EMAIL PROTECTED]

To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.



Re: htdig: HTDig questions

1998-09-17 Thread Jacques Le Mouel

lws wrote:
> 
> Hi,
> I also have HPUX.  I am running 10.20 and I wondered what
> versions of things you used when you built htdig.  I have tried the
> last 2 versions(3.0.8b2 & 3.1.0b1) and they both blow up horribly.
> I am using gcc 2.7.2.3, libg++ 2.7.2, binutils 2.7 and sed 2.05.  Do
> you mind telling me what versions you used?  Also, what directories
> and paths did you use?  Thanks for your time.
> 
> Larry Sturtz
> 
> Date:  Wed, 16 Sep 1998 16:52:20 -0400
> From:  Jacques Le Mouel <[EMAIL PROTECTED]>
> To:HTDig mailing list <[EMAIL PROTECTED]>
> Subject:   htdig: HTDig questions
> Reply-to:  Jacques Le Mouel <[EMAIL PROTECTED]>
> 
> First, thanks and congratulation to all involved for such a great tool.
etc...


We run a HP-UX B.10.01 A 9000/847
I had nothing to begin with, so I had to get and install all the needed
packages from scratch:

from Adobe
- acroread_hpux_301_tar.gz

from gnu
- gcc-2_8_1_tar.gz
- make-3_77_tar.gz

and
- libstdc++-2_8_1_tar(1).gz
- libg++-2_7_2_tar.gz
- libstdc++-2_8_1_1_tar(1).gz

of course, only one of these 3 libs is needed, but as I was struggling
with a pb totally unrelated (wrong make) but that was complaining about
the library, I gradually downloaded more and more; I can't say exactly
which one exactly is the right one; I believe it is the libstdc++
2.8.1.1

from htdig
- htdig-pdf_tar.gz
- htdig-3_0_8b2_tar.gz

and I added the patches described by Sylvain for PDF and the patch for
htcommon/defaults.cc.old (I can't remember why).

Took me a while to do all this, but it worked pretty easily.

Hope this helps.

PS: indeed, it is a long install process; I have posted a porting
request on http://hpux.csc.liv.ac.uk/, hoping a good soul will create a
HP executable install package
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: HTDig questions

1998-09-17 Thread Andrew Scherpbier

Geoff Hutchison wrote:
> 
> >- Is it possible to specify at search time whether to use endings or not
> 
> You can have a "conf" field with a pop-up menu. Then one conf file can have
> the endings and another won't. Otherwise the conf files would be identical.

In addition to that, you can specify the search algorithm to be used at search
time by prefixing the words with the algorithms.
For example, regardless of how the config file is setup, you can always use
something like:

exact:computer

which will search for 'computer' using the 'exact' algorithm.
You can specify multiple algorithms as well as specifying the weight for each
algorithm.
Check the docs/sources on the details of this.

BTW, I realize that this is probably not very well documented.  In addition to
the regular search words, you can use the algorithm prefix stuff in the
'keywords' field as well.  This is how one user of ht://Dig restricts where to
search; he put a pulldown list of general search terms on the search form so
that a search can be performed on a subset of his site that otherwise could
not be divided up by looking at URLs.
(Not sure how to explain all this...)
-- 
Andrew Scherpbier <[EMAIL PROTECTED]>
Contigo Software 
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: HTDig questions

1998-09-16 Thread Geoff Hutchison


>- Is there a freely available list of synonyms, in English; we would be
>most interested by one focusing on the Telecom industry jargon

Well a new list of synonyms has been compiled by John Banbury
<[EMAIL PROTECTED]>. If you'd like a copy, I can mail it to the list or
put it up somewhere.

>- Is there a way to set-up a list of "anti-bad words", i.e. force the
>index words smaller than 3 letters"), still pick-up all instances of AT
>as it can be meaningful to us

Hmm. Right now the best way to do this is to change max_word_length to 2
and try to eliminate all non-useful short words in the bad_words
dictionary. I'm working on a patch to allow factoring the word frequency.
So words like "the" wouldn't need to be explicitly in bad_words since
they'd have such a high frequency.

>- Is there an easy way to extract the list of all the documents in the
>library; conceptually, it could be done by searching with the WOrds
>field empty, but this fails

Yup. Running htdig -t "Create an ASCII version of the document database."
Of course this won't be just a list of the documents...

>- Is it possible to specify at search time whether to use endings or not

You can have a "conf" field with a pop-up menu. Then one conf file can have
the endings and another won't. Otherwise the conf files would be identical.

>- Is it possible to search for phrases

Not yet.

>- What about the new version with DB2 instead of GDB? Why the change? Is
>it quicker? It seems it is still in Beta; when is it supposed to be
>released in final form?

DB2 is faster, for one. The htmerge program is usually 2-3 times faster for
me and searches are faster as well. As for "beta," well... The version most
people had recently was 3.0.8b2 which was "beta." The new version is also
"beta" since I haven't had a chance to test it on a lot of platforms. It
should be more stable than 3.0.8b2. As for a "final release," that depends
on how many bugs are found... :-)

>- Is there a way to dig several documents at the same time in parallel
>(i.e. convert and read through several PDFs at the same time)? Would
>this speed-up the indexing process?

This cannot be done at the moment. It might speed up the indexing depending
on the speed of your CPU and all that.

>- Is it possible to have multiple restricts at search time, like:
>restrict to URL that include both /myserver/docs/subject1 AND .pdf

I believe this should work: 

>- Is it possible to index multiple servers? Does this requires multiple
>.conf files or can it be done using only one .conf file?

No, you can use one conf file. Just change the limit_urls_to conf option.
limit_urls_to: wso.williams.edu ethel.williams.edu

>- Are there tools to "massage" the database once it is created; for
>instance, to remove some docs from it, ... in order to avaoid a complete
>rebuild (think of my 16 hours, and we are barelly half way through
>loading the site...)

No idea. It wouldn't be hard to write them, but I don't think they exist at
the moment. There are a few programs in contrib/ that may do similar things.

>- Are bad words excluded at the dig/merge time, or at search time (which
>would increase the size of the database for nothing)
>- Is Proximity searching supported?

See phrase searching. If we had proximity (near) searching, wouldn't we
have phrase searching too? :-)

>- Can Htdig support hit hiliting within PDF documents by using byte
>serving and (is it) XML (?)?


-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



htdig: HTDig questions

1998-09-16 Thread Jacques Le Mouel

First, thanks and congratulation to all involved for such a great tool.

Then, I have many questions regarding HTDig.
I have recently set-up HTDig 3.0.8b2 on a HP-Ux 9000/800 G30 running
HP-Ux 10.01 and Netscape Enterprise server 2.0.
This is to provide search facility to an Intranet site of a few thousand
documents in PDF.

The questions are (in no particular order, as they say):
- Is there a freely available list of synonyms, in English; we would be
most interested by one focusing on the Telecom industry jargon
- Is there a way to set-up a list of "anti-bad words", i.e. force the
inclusion of a word when it appears in a document, even if it is smaller
than the threshold; for instance, if we cut at 3 letters (i.e. "do not
index words smaller than 3 letters"), still pick-up all instances of AT
as it can be meaningful to us
- If I run htdig -i and later change the .conf file, the changes don't
seem to be taken in consideration (especially start page, excludes...).
Am I doing something wrong?
- Is there an easy way to extract the list of all the documents in the
library; conceptually, it could be done by searching with the WOrds
field empty, but this fails
- Is it possible to specify at search time whether to use endings or not
- Is it possible to search for phrases
- What about the new version with DB2 instead of GDB? Why the change? Is
it quicker? It seems it is still in Beta; when is it supposed to be
released in final form? Considering it requires rebuilding the index
from scratch, and it takes us a very long time to do that (16 hours on
our server, for 160MB of PDF; the really slow part is the "acroread
-toPostScript" for all the files), is it worth moving to this version?
- Is there a way to dig several documents at the same time in parallel
(i.e. convert and read through several PDFs at the same time)? Would
this speed-up the indexing process?
- Is it possible to have multiple restricts at search time, like:
restrict to URL that include both /myserver/docs/subject1 AND .pdf
- Is it possible to index multiple servers? Does this requires multiple
.conf files or can it be done using only one .conf file?
- Are there tools to "massage" the database once it is created; for
instance, to remove some docs from it, ... in order to avaoid a complete
rebuild (think of my 16 hours, and we are barelly half way through
loading the site...)
- Are bad words excluded at the dig/merge time, or at search time (which
would increase the size of the database for nothing)
- Is Proximity searching supported?
- Can Htdig support hit hiliting within PDF documents by using byte
serving and (is it) XML (?)?

OK, enough question for now maybe.
Any help would be greatly appreciated.

Jacques Le Mouel
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Questions!!

1998-07-11 Thread Lars Appel

At 12:47 09.07.98 -0500, you wrote:
>I've just decompress your software and there is a question I have.
>Is this software made exclusivelly to run over Solaris?
>If so, it won't work cause I need a search engine to HP with HP-UX

I can confirm that it's not just for Solaris as I have been able
to port it to MPE/iX, the HP 3000 operating system (which is not
just another Unix like HP-UX etc) quite easily, so it should also
be possible to build ht://dig on HP 9000 systems with HP-UX.

A public thanks to Andrew for this nice piece of (portable) software!

Lars Appel, HP Germany (speaking only for myself here)

(also see ftp://ftp.3k.com/POSIX/htdig.htm re ht://dig on MPE/iX)
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Questions!!

1998-07-09 Thread Andrew Scherpbier

Jesus Valadez Sanchez wrote:
> 
> I've just decompress your software and there is a question I have.
> Is this software made exclusivelly to run over Solaris?
> If so, it won't work cause I need a search engine to HP with HP-UX
> 
> Please answer me!
> Thanks.

ht://Dig will compile and run on many different flavors of unix, including
solaris, hp/us, linux, etc.
-- 
Andrew Scherpbier <[EMAIL PROTECTED]>
Contigo Software 
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



htdig: Questions!!

1998-07-09 Thread Jesus Valadez Sanchez

I've just decompress your software and there is a question I have.
Is this software made exclusivelly to run over Solaris?
If so, it won't work cause I need a search engine to HP with HP-UX

Please answer me!
Thanks.


--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.