Jim wrote:

The problem is in the db code. As I recall a few people tried to work around the problem some time back, but were never able to get the db code to compile properly with large file support under Linux. I am not aware of any easy fix or workaround for 3.1.6. I believe the 3.2 branch supports files larger than 2GB under Linux, however it is still beta and currently provides less than stellar indexing performance, which might be an issue if you are working with large document collections.

Jim

Migrating to 3.2 is not an option for us at the moment because we made some patches in htdig for our environment. You can find some hints of what we did in my postings during the last years. We need some time for patching and testing.
Because we are in the process of migrating our webservices from sun to dell/linux systems, we need htdig 3.1.6 so that we can keep our configuration (for the userinterface) at the moment.


To me, wordlist.work looks like an ASCII file. This file was the reason for a crash at least two times. With my last version of htdig the programm reached 2 GB with wordlist.work after about 45000 documents.

Some statistics:
With HTML, ASCII, PDF and PS documents we have over 230000 documents on our webservers indexed with htdig on solaris. If we start indexing other document types as well, this will increase significantly. And we have an increase of about 40000 new documents each year on our webservers.
On our old solaris machine htdig needs about a week with high load for a full run. On our Dell PowerEdge 2650/RedHat 7.3 htdig needs only 25 hours (if it doesn't crash x-( ). And the maschine can still be used for other services.


I took a look at the sources. One relevant part of the code can be found in htcommon/WordList.cc in Wordlist::Flush().


//***************************************************************************** // void WordList::Flush() // Dump the current list of words to the temporary word file. After // the words have been dumped, the list will be destroyed to make // room for the words of the next document. // void WordList::Flush() { FILE *fl = fopen(tempfile, "a"); WordReference *wordRef;

    words->Start_Get();
    while ((wordRef = (WordReference *) words->Get_NextElement()))
    {

        fprintf(fl, "%s",wordRef->Word.get());
        fprintf(fl, "\ti:%d\tl:%d\tw:%d",
                wordRef->DocumentID,
                wordRef->Location,
                wordRef->Weight);
        if (wordRef->WordCount != 1)
        {
           fprintf(fl, "\tc:%d",wordRef->WordCount);
        }
        if (wordRef->Anchor != 0)
        {
           fprintf(fl, "\ta:%d",wordRef->Anchor);
        }
        putc('\n', fl);
    }
    words->Destroy();
    fclose(fl);
}



It's fopen instead fopen64 that is used to open the file (and fprintf and fclose ...). And there is nothing in the Makefile that causes the compiler to use LFS as far as I understand it.

I've build a version of htdig with '-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'. I hope I can test it during the next days. If it works as expected it will take only a patch of the Makefile to solve this problem on linux systems.

If there is also a 2 GB problem in the db sources, it might be worth a try to replace the old db2 version 2.6.4 code with a newer release. Has anybody thought about it?

Berthold

--
Dr. rer. nat. Berthold Cogel                   University of Cologne
E-Mail: [EMAIL PROTECTED]                 ZAIK-US (RRZK)
Tel.:   +49(0)221/478-5572                     Robert-Koch-Str. 10
FAX:    +49(0)221/478-5590                     D-50931 Cologne - Germany
WWW: http://www.uni--koeln.de/rrzk/


------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to