Re: [htdig-dev] Logical Error in Indexer???
--- Neal Richter <[EMAIL PROTECTED]> wrote: > Hey all, > I've got a question for all of you about how the > htdig 'indexer' > should function. > I've tested this fix and it works. > > Eh? I felt like I was sharing a beer with you at the pub, and you just got done "schematicizing" the problem and fix on a napkin-coaster and ended it with, "Eh?" Sounds like a good fix to a problem that I think (subconciously) I knew existed. How about this one -- does your patch help with the check_unique_md5 problem? Even when I use a "-i" option (or without), if the start_url's MD5 hash-sig matches the one from my previous index, it just says that it detected an MD5 duplicate and exits. Deleting db.md5hash.db seems to do the trick. But would that be sacrilege removing the db.md5hash.db before a refresh? -Jes __ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com --- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ___ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
[htdig-dev] Logical Error in Indexer???
Hey all, I've got a question for all of you about how the htdig 'indexer' should function. htdig.cc 337 List*list = docs.URLs(); 338 retriever.Initial(*list); 339 delete list; 340 341 // Add start_url to the initial list of the retriever. 342 // Don't check a URL twice! 343 // Beware order is important, if this bugs you could change 344 // previous line retriever.Initial(*list, 0) to Initial(*list,1) 345 retriever.Initial(config->Find("start_url"), 1); Note lines 337-339. This code loads the entire list of documents currently in the index and feeds this to the retriever object for retrieval and processing. The effect of this is that we potentially are visiting and keeping webpages that we aren't about to find via a link, and we will keep revisiting a website even if we remove it from the 'start_url' in htdig.conf. The workaround is to use 'htdig -i'. This is a disadvantage as we will revisit and index pages even if they haven't changes since the last run of htdig. Here's the Fix: 1) At the start of Htdig, after we've opened the DBs we 'walk' the docDB and mark EVERY document as Reference_obsolete. I wrote code to do this.. very short. 2) Comment out htdig.cc 337-339 3) When the indexer fires up and spiders a site, documents that are in the tree and marked as Reference_obsolete are remarked as Reference_normal. 4) when htpurge is run, the obsoleted docs are flushed. Documents that aren't revisited (since a link isn't found) are flushed. This is fix addresses two flaws: 1)Changing 'start_url' and removing a starting url.. the documents are still in the index after the next run of htdig (unless you use -i) 2)Pages that still exist on a webserver at a give URL, that are no longer linked to by any other pages on the site. I've tested this fix and it works. Eh? Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 --- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ___ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
Re: [htdig-dev] Cygwin words.db Compression
Hey, I have produced a set of makefiles for a native windows binaries. You do need cygwin to run 'make' (the makefiles are for GNU make). The makefiles use the Microsoft compiler. Could you get a copy of the latest snapshot and try and do the build? I'll work with you to get it fixed if it's still broken. We've tested older snapshots of HtDig compiled Win32 native and run nearly a million documents through it If this doesn't satisfy your needs, I'd be willing to put in some time looking at the cygwin build. Neal Richter. On Thu, 2 Oct 2003, Steve Eidemiller wrote: > I'm compiling htdig-3.2.0b4-20090928 under Cygwin 1.5.5 using gcc 3.3.1, on both > Windows XP Pro SP1 and Windows 2000 Server SP4. Compiling and installation is not a > problem. But db.words.db is always a zero length file after running htdig with the > compression flags at their default values. After some profiling I also noticed that > it wasn't creating the "db.words.db.work_weakcmpr" file during the dig. When > compiled under Cygwin 1.3.22 using "gcc-3.2 20020927 (prerelease)", the work file > *is* created during the dig and db.words.db has size to it afterwards. However, I am > not able to htdb_dump that file or use htsearch against it. It's corrupt or > something. The other db files seem to get created fine under both sets of binaries, > although I didn't try to dump them. And the same version related behavior occurs > under both XP and 2000 OS's. > > After reading all the SF posts about compression and db issues, I decided to disable > compression and see what happens: > > wordlist_compress: false > wordlist_compress_zlib: false > compression_level: 0 > > With those settings, everything appears to work fine for both sets of binaries: I > can dig pages and run htsearch. I haven't modified any of the code to try and > address the problem yet, but it looks like others are having similar issues on other > platforms? Is anybody else having trouble with db compression on Windows? I have > tried different settings for compression_level with no success. > > Also, my initial attempts at changing the compression flag values failed with error > messages from htdig while trying to read the configuration file. It seems that the > htdig.conf parser doesn't like CR (ASCII=13) characters. Notepad and Wordpad are > obvious choices for editing this file on Windows, but those don't work because both > insert CRLF pairs to terminate lines in the file (e.g. DOS format). And then the > parser apparently won't see flags at the bottom of the CRLF file. The solution was a > simple JavaScript program to modify htdig.conf by removing all CR characters > *before* running htdig. Is anybody else seeing this on Cygwin builds? > > Sorry for the long post :) > > PS - I'm running 3.1.6 in production on Windows at > http://www.childrenshc.org/Search/ and it rocks!! > > Thanx > -Steve > __ > > Confidentiality Statement: > This email/fax, including attachments, may include confidential and/or proprietary > information and may be used only by the person or entity to which it is addressed. > If the reader of this email/fax is not the intended recipient or his or her agent, > the reader is hereby notified that any dissemination, distribution or copying of > this email/fax is prohibited. If you have received this email/fax in error, please > notify the sender by replying to this message and deleting this email or destroying > this facsimile immediately. > > > --- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > ___ > ht://Dig Developer mailing list: > [EMAIL PROTECTED] > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 --- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ___ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
[htdig-dev] Cygwin words.db Compression
I'm compiling htdig-3.2.0b4-20090928 under Cygwin 1.5.5 using gcc 3.3.1, on both Windows XP Pro SP1 and Windows 2000 Server SP4. Compiling and installation is not a problem. But db.words.db is always a zero length file after running htdig with the compression flags at their default values. After some profiling I also noticed that it wasn't creating the "db.words.db.work_weakcmpr" file during the dig. When compiled under Cygwin 1.3.22 using "gcc-3.2 20020927 (prerelease)", the work file *is* created during the dig and db.words.db has size to it afterwards. However, I am not able to htdb_dump that file or use htsearch against it. It's corrupt or something. The other db files seem to get created fine under both sets of binaries, although I didn't try to dump them. And the same version related behavior occurs under both XP and 2000 OS's. After reading all the SF posts about compression and db issues, I decided to disable compression and see what happens: wordlist_compress: false wordlist_compress_zlib: false compression_level: 0 With those settings, everything appears to work fine for both sets of binaries: I can dig pages and run htsearch. I haven't modified any of the code to try and address the problem yet, but it looks like others are having similar issues on other platforms? Is anybody else having trouble with db compression on Windows? I have tried different settings for compression_level with no success. Also, my initial attempts at changing the compression flag values failed with error messages from htdig while trying to read the configuration file. It seems that the htdig.conf parser doesn't like CR (ASCII=13) characters. Notepad and Wordpad are obvious choices for editing this file on Windows, but those don't work because both insert CRLF pairs to terminate lines in the file (e.g. DOS format). And then the parser apparently won't see flags at the bottom of the CRLF file. The solution was a simple JavaScript program to modify htdig.conf by removing all CR characters *before* running htdig. Is anybody else seeing this on Cygwin builds? Sorry for the long post :) PS - I'm running 3.1.6 in production on Windows at http://www.childrenshc.org/Search/ and it rocks!! Thanx -Steve __ Confidentiality Statement: This email/fax, including attachments, may include confidential and/or proprietary information and may be used only by the person or entity to which it is addressed. If the reader of this email/fax is not the intended recipient or his or her agent, the reader is hereby notified that any dissemination, distribution or copying of this email/fax is prohibited. If you have received this email/fax in error, please notify the sender by replying to this message and deleting this email or destroying this facsimile immediately. --- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ___ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev