Session Info Database API concepts [regarding wget and gsoc]

Micah Cowan Thu, 20 Mar 2008 12:33:34 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(I've rearranged your post slightly so that the answers are ordered more
conveniently for me.)

Siddhant Goel wrote:
> Hi.
> I couldn't gather much from the wget source code; so want to put up some
> questions.
> 1. Does this necessarily have to be a change in the code of wget? Or
> something separate?

I don't see how a separate program could get all the necessary
information to put this together; particularly things like content-type,
etc. Perhaps some of the top use cases might be implemented with some
sort of wrapper around log files, but even there, without access to
internal Wget structures, I don't see how even file-existence checks
could be done (since the external program would have to know what Wget
was considering downloading).

> One thing I had in mind. Does grepping the "database" file, each time a
> download has to be made, sound good? This way you could check easily for
> every issue mentioned here -
> http://wget.addictivecode.org/FeatureSpecifications/MetaDataBase .

Using grep each time Wget has to decide whether to download would be
absolutely hideous, in terms of efficiency. Wget would have to fork a
new process for grep, which would then read the entire database file on
each download. Other operations might require Wget to invoke grep
multiple times.

OTOH, if Wget builds appropriate data structures (hash tables and the
like), it can determine in constant time whether a given URL was already
downloaded in a previous session.

No, grep is not a viable option. Besides, wget is intended to run on
non-Unixen (notably, Windows, and VMS too, at least to some degree);
requiring grep (or other external utilities) is not desirable.

> 2. If it does, could you please tell me the files I should look into, to
> get started. I have the general solution in mind, but to implement I
> need to know where to start. :)

It would be wholly new code, so it'd be in a new module. You'd need to
interface with it from existing code, naturally; but to start, one might
wish to design and code the module completely apart from Wget, and write
little test drivers to demonstrate how it might be used from within Wget.

For instance, you might start by writing functions with prototypes like:

  /** SIDB Writer Facilities **/
  sidb_writer *
  sidb_write_start(const char *filename);

  sidb_writer_entry *
  sidb_writer_entry_new(sidb_writer *w, const char *uri);

  void
  sidb_writer_entry_redirect(sidb_writer_entry *rw, const char *uri, int
redirect_http_status_code);

  void
  sidb_writer_entry_local_path(sidb_writer_entry *rw, const char
    *fname);

  void
  sidb_entry_finish(sidb_writer_entry *rw);

  void
  sidb_write_end(sidb_writer *w);

  /** SIDB Reader Facilities **/
  sidb_reader *
  sidb_read(const char *filename);

  sidb_entry *
  sidb_lookup_uri(struct sidb_reader *, const char *uri);

  const char *
  sidb_entry_get_local_name(sidb_entry *e);

"sidb" for "Session Info DataBase". I've left out error code returns; it
seems to me that wget will not normally want to terminate just because
an error occurred in writing the session info database; we can ask the
sidb modules to spew warnings automatically when errors occur by giving
it flags, or supply it with an error callback function, etc. For
situations where we do want wget to immediately abort for sidb errors
(for instance, continuing a session), it could check a sidb_error
function or some such.

The intent is that all the writer operations would take virtually no
time at all. The sidb_read function should take at most O(N log N) time
on the size of the SIDB file, and should take less than a second under
normal circumstances on typical machines, for a file with entries for a
thousand web resources. Thereafter, the other SIDB reader operations
should take virtually no time at all.

The sidb_lookup_uri should be able to find an entry based on either the
URI that was specified to the corresponding call to
sidb_writer_entry_new, _or_ by any URI that was added to a resource
entry via sidb_writer_entry_redirect.

Interposing writes to different entrys should be allowed and explicitly
tested (to prepare the way for multiple simultaneous downloads in the
future). That is, the following should be valid:

  sidb_writer *sw = sidb_write_start(".wget-sidb");
  sidb_writer_entry *foo, *bar;
  foo = sidb_writer_entry_new(sw, "http://example.com/foo";);
  bar = sidb_writer_entry_new(sw, "http://example.com/bar";);
 /* Add info to foo entry. */
  sidb_writer_entry_redirect(foo,
    "http://newsite.example.com/news.html";);
 /* Add to bar entry. */
  sidb_writer_entry_local_path(bar, "example.com/bar");
 /* Add to foo entry again. */
  sidb_writer_entry_local_name(foo, "newsite.example.com/news.html");
  sidb_entry_finish(foo);
  sidb_entry_finish(bar);
  sidb_write_end(sw);

On reading back the information, calling sidb_lookup_uri for either
"http://example.com/foo"; or "http://newsite.example.com/news.html";
should both give "newsite.example.com/news.html".

(Note: failure to allocate appropriate resources in the call to
sidb_write_start should nevertheless return a valid sidb_writer, and
calls to other writer operations using that handle remain valid, even if
they don't actually do anything.)

The following should also result in a well-formed SIDB file:

  sidb_writer *sw = sidb_write_start(".wget-sidb");
  sidb_writer_entry *foo
    = sidb_writer_entry_new(sw, "http://example.com/foo";);
  sidb_writer_entry_local_path(foo, "example.com/foo.html");
  raise(SIGKILL); /* Die with uncatchable signal. */

The only difference between the above, and the same code with
appropriate cleanup (especially, sidb_write_end(sw)), is that wget will
handle the former under the assumption that the session never completed
downloading of foo.html; when support for continued sessions is added,
wget would automatically attempt to resume download of foo.html.

Alright, that should be more than enough information for someone to
start coding it.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH4rv07M8hyUobTrERAjpvAJ0e5M/AgZOiLSQpae6QESqIy1iVigCffZml
iKy+UCTGX7PGpWeFMnoygX0=
=JbYd
-----END PGP SIGNATURE-----

Session Info Database API concepts [regarding wget and gsoc]

Reply via email to