According to Geoff Hutchison:
> Marjolein noted a bug in the Document code. If you do a search on
> htdig.org, you can see it in action. Search for any attribute, say
> pdf_parser and look at the results for attrs.html. The document's size is
> reported as max_doc_size when the document has been trimmed. In this case,
> attrs.html is reported as 100K, when it's 155+K.
> 
> I'm not sure this is the best fix, but it seems to work. The document size
> is now reported as the size sent by the server (if available) or by stat()
> when retrieving locally. In particular, I don't know much about the library
> calls -- is st_size a field of all stat types?
> 
> -Geoff
> 
> Index: Document.cc
> ===================================================================
> RCS file: /opt/htdig/cvs/htdig3/htdig/Document.cc,v
> retrieving revision 1.34
> diff -r1.34 Document.cc
> 447a448,450
> >
> >     if (document_length < contentLength)
> >       document_length = contentLength;
> 598,599c601,602
> <     document_length = contents.length();
> <     contentLength = document_length;
> ---
> >     document_length = stat_buf.st_size;
> >     contentLength = contents.length();

For the sake of consistency and cleanliness, I'd make the patch to
RetrieveLocal be more like the one to RetrieveHTTP.  I.e. set
contentLength = stat_buf.st_size, and document_length = contents.length(),
then after the total number of bytes read is reported, add the same
if statement as above, to increase document_length if necessary.
As it stands with your patch, RetrieveLocal will report the incorrect
number of bytes read as being the whole file size, even if it's larger
than max_doc_size.

The setting of contentLength isn't critical at this point.  That's
something I added to RetrieveHTTP, so that it wouldn't try to read
more than the length specified by the Content-Length header - this
was a problem before when digging a particular MIT server.  I set
contentLength in RetrieveLocal, just so that future additions may
be able to make use of it, but right now nothing does.  It might
actually make sense to rewrite the reading loop in RetrieveLocal to
look just like the one in RetrieveHTTP, with the whole BytesToGo
business, again for consistency.

Yes, st_size should be guaranteed to be a part of struct stat - it's
been there since the very early days of UNIX.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to