Re: [htdig-dev] Logical Error in Indexer???

2003-10-02 Thread Jessica Biola
--- Neal Richter <[EMAIL PROTECTED]> wrote:
> Hey all,
>   I've got a question for all of you about how the
> htdig 'indexer'
> should function.
> I've tested this fix and it works.
> 
> Eh?

I felt like I was sharing a beer with you at the pub,
and you just got done "schematicizing" the problem and
fix on a napkin-coaster and ended it with, "Eh?"

Sounds like a good fix to a problem that I think
(subconciously) I knew existed.

How about this one -- does your patch help with the
check_unique_md5 problem?  Even when I use a "-i"
option (or without), if the start_url's MD5 hash-sig
matches the one from my previous index, it just says
that it detected an MD5 duplicate and exits.

Deleting db.md5hash.db seems to do the trick.  But
would that be sacrilege removing the db.md5hash.db
before a refresh?

-Jes


__
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com


---
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
___
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev


[htdig-dev] Logical Error in Indexer???

2003-10-02 Thread Neal Richter
Hey all,
I've got a question for all of you about how the htdig 'indexer'
should function.

htdig.cc
337 List*list = docs.URLs();
338 retriever.Initial(*list);
339 delete list;
340
341 // Add start_url to the initial list of the retriever.
342 // Don't check a URL twice!
343 // Beware order is important, if this bugs you could change
344 // previous line retriever.Initial(*list, 0) to Initial(*list,1)
345 retriever.Initial(config->Find("start_url"), 1);

Note lines 337-339.  This code loads the entire list of documents
currently in the index and feeds this to the retriever object for
retrieval and processing.

The effect of this is that we potentially are visiting and keeping
webpages that we aren't about to find via a link, and we will keep
revisiting a website even if we remove it from the 'start_url' in
htdig.conf.

The workaround is to use 'htdig -i'.  This is a disadvantage as we will
revisit and index pages even if they haven't changes since the last run of
htdig.

Here's the Fix:

1) At the start of Htdig, after we've opened the DBs we 'walk' the docDB
and mark EVERY document as Reference_obsolete.  I wrote code to do this..
very short.

2) Comment out htdig.cc 337-339

3) When the indexer fires up and spiders a site, documents that are in
the tree and marked as Reference_obsolete are remarked as
Reference_normal.

4) when htpurge is run, the obsoleted docs are flushed.

Documents that aren't revisited (since a link isn't found) are flushed.

This is fix addresses two flaws:

1)Changing 'start_url' and removing a starting url.. the documents are
still in the index after the next run of htdig (unless you use -i)

2)Pages that still exist on a webserver at a give URL, that are no longer
linked to by any other pages on the site.

I've tested this fix and it works.

Eh?

Thanks.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485




---
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
___
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev


Re: [htdig-dev] Cygwin words.db Compression

2003-10-02 Thread Neal Richter

Hey,
I have produced a set of makefiles for a native windows binaries.
You do need cygwin to run 'make' (the makefiles are for GNU make).  The
makefiles use the Microsoft compiler.

Could you get a copy of the latest snapshot and try and do the
build?  I'll work with you to get it fixed if it's still broken.

We've tested older snapshots of HtDig compiled Win32 native and
run nearly a million documents through it

If this doesn't satisfy your needs, I'd be willing to put in some
time looking at the cygwin build.

Neal Richter.

On Thu, 2 Oct 2003, Steve Eidemiller wrote:

> I'm compiling htdig-3.2.0b4-20090928 under Cygwin 1.5.5 using gcc 3.3.1, on both 
> Windows XP Pro SP1 and Windows 2000 Server SP4. Compiling and installation is not a 
> problem. But db.words.db is always a zero length file after running htdig with the 
> compression flags at their default values. After some profiling I also noticed that 
> it wasn't creating the "db.words.db.work_weakcmpr" file during the dig. When 
> compiled under Cygwin 1.3.22 using "gcc-3.2 20020927 (prerelease)", the work file 
> *is* created during the dig and db.words.db has size to it afterwards. However, I am 
> not able to htdb_dump that file or use htsearch against it. It's corrupt or 
> something. The other db files seem to get created fine under both sets of binaries, 
> although I didn't try to dump them. And the same version related behavior occurs 
> under both XP and 2000 OS's.
>
> After reading all the SF posts about compression and db issues, I decided to disable 
> compression and see what happens:
>
> wordlist_compress: false
> wordlist_compress_zlib: false
> compression_level: 0
>
> With those settings, everything appears to work fine for both sets of binaries: I 
> can dig pages and run htsearch. I haven't modified any of the code to try and 
> address the problem yet, but it looks like others are having similar issues on other 
> platforms? Is anybody else having trouble with db compression on Windows? I have 
> tried different settings for compression_level with no success.
>
> Also, my initial attempts at changing the compression flag values failed with error 
> messages from htdig while trying to read the configuration file. It seems that the 
> htdig.conf parser doesn't like CR (ASCII=13) characters. Notepad and Wordpad are 
> obvious choices for editing this file on Windows, but those don't work because both 
> insert CRLF pairs to terminate lines in the file (e.g. DOS format). And then the 
> parser apparently won't see flags at the bottom of the CRLF file. The solution was a 
> simple JavaScript program to modify htdig.conf by removing all CR characters 
> *before* running htdig. Is anybody else seeing this on Cygwin builds?
>
> Sorry for the long post :)
>
> PS - I'm running 3.1.6 in production on Windows at 
> http://www.childrenshc.org/Search/ and it rocks!!
>
> Thanx
> -Steve
> __
>
> Confidentiality Statement:
> This email/fax, including attachments, may include confidential and/or proprietary 
> information and may be used only by the person or entity to which it is addressed. 
> If the reader of this email/fax is not the intended recipient or his or her agent, 
> the reader is hereby notified that any dissemination, distribution or copying of 
> this email/fax is prohibited. If you have received this email/fax in error, please 
> notify the sender by replying to this message and deleting this email or destroying 
> this facsimile immediately.
>
>
> ---
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> ___
> ht://Dig Developer mailing list:
> [EMAIL PROTECTED]
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-dev
>

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485




---
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
___
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev


[htdig-dev] Cygwin words.db Compression

2003-10-02 Thread Steve Eidemiller
I'm compiling htdig-3.2.0b4-20090928 under Cygwin 1.5.5 using gcc 3.3.1, on both 
Windows XP Pro SP1 and Windows 2000 Server SP4. Compiling and installation is not a 
problem. But db.words.db is always a zero length file after running htdig with the 
compression flags at their default values. After some profiling I also noticed that it 
wasn't creating the "db.words.db.work_weakcmpr" file during the dig. When compiled 
under Cygwin 1.3.22 using "gcc-3.2 20020927 (prerelease)", the work file *is* created 
during the dig and db.words.db has size to it afterwards. However, I am not able to 
htdb_dump that file or use htsearch against it. It's corrupt or something. The other 
db files seem to get created fine under both sets of binaries, although I didn't try 
to dump them. And the same version related behavior occurs under both XP and 2000 OS's.

After reading all the SF posts about compression and db issues, I decided to disable 
compression and see what happens:

wordlist_compress: false
wordlist_compress_zlib: false
compression_level: 0

With those settings, everything appears to work fine for both sets of binaries: I can 
dig pages and run htsearch. I haven't modified any of the code to try and address the 
problem yet, but it looks like others are having similar issues on other platforms? Is 
anybody else having trouble with db compression on Windows? I have tried different 
settings for compression_level with no success.

Also, my initial attempts at changing the compression flag values failed with error 
messages from htdig while trying to read the configuration file. It seems that the 
htdig.conf parser doesn't like CR (ASCII=13) characters. Notepad and Wordpad are 
obvious choices for editing this file on Windows, but those don't work because both 
insert CRLF pairs to terminate lines in the file (e.g. DOS format). And then the 
parser apparently won't see flags at the bottom of the CRLF file. The solution was a 
simple JavaScript program to modify htdig.conf by removing all CR characters *before* 
running htdig. Is anybody else seeing this on Cygwin builds?

Sorry for the long post :)

PS - I'm running 3.1.6 in production on Windows at http://www.childrenshc.org/Search/ 
and it rocks!!

Thanx
-Steve
__

Confidentiality Statement:
This email/fax, including attachments, may include confidential and/or proprietary 
information and may be used only by the person or entity to which it is addressed. If 
the reader of this email/fax is not the intended recipient or his or her agent, the 
reader is hereby notified that any dissemination, distribution or copying of this 
email/fax is prohibited. If you have received this email/fax in error, please notify 
the sender by replying to this message and deleting this email or destroying this 
facsimile immediately.


---
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
___
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev