Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-26 Thread Siddhant Goel


 Thanks, Siddhant; I think this version is much improved over the
 previous attempt.


Thank *you*! I have revised it again and its waiting for you! :)



 I'd probably try not to get into the details of what the session db file
 might look like; that's probably one of the things we'd need to define
 as part of our initial discussions. I'm guessing that the list of
 entries in your abstract are probably intended as only examples of the
 _kind_ of data that would go in there, rather than the particulars of
 how that data should look, but if I were you I'd say that explicitly.
 (Also, that sort of information is probably not so appropriate in the
 Abstract, which is usually intended to be a very high-level
 description of what the feature is).

I have now mentioned, that the list I am providing gives a fair enough
*idea* of what needs to be implemented.
Also, I agree that that list should be there in the How/Deliverables section
,rather than being in the Abstract. Corrected.



 To comment on some of the specific items you have in that list: I'm not
 sure that Download # will be a useful piece of information, except
 perhaps for use in identifying a resource across interleaved entries.

I forgot to mention in the last proposal. I will use a download number to
mainly address the issue of redirect(s).  Suppose if there is a file that is
being downloaded, and there happens a redirect, the current entry will be
written to the file, and a new entry will start being written. The download
# of this new entry will be the value of chain_of_redirects for the previous
entry.



 The local_filename and localpath entries seem redundant to me.

Me too. I do not know what got into my head when I made such a stupid
assumption. :( Corrected now.



 A boolean download_status strikes me as not informative enough for
 what we will want it for. It would be very useful to distinguish between
 a download that failed due to a 404 Not Found and one that failed due to
 a 403 Forbidden; as well as distinguishing between loss of connections,
 failure to connect, proxy-specific failures, and name resolution failures.

For this issue, I have made two sections in an entry. A download_status
value (which should be 1 if the download was successful, and 0 if it was
not), and a status_reason value (which, as I see it, will contain specific
codes for specific failure reasons. 1 could be for a 404, 2 could be for a
403, 3 could be for a name resolution error, and so on).



 All in all, though, it's best not to focus on the details of what
 entries might _look_ like, but how they will enable Wget to _use_ them.
 That is, rather than saying that the session db will have localpath or
 chain_of_redirects entries, it's more productive to say that Wget will
 be able to be able to find the path of the locally-downloaded file
 corresponding to any URI from the single URI or chain of redirected URIs
 that; or that Wget will determine what files remain to be downloaded
 and/or parsed to continue where a previous session left off.

This information should now be clear, as I have explicitly mentioned each
and every use for each and every section in an entry, along with providing a
simple procedure for a lookup at the end of the How/Deliverables section.



 It'd be really nice to touch on a little bit of _how_ Wget will handle
 looking up a local pathname from a chain of redirects (i.e., if you're
 not going to do it by using grep for each download, how _will_ you
 accomplish it?).

Yes. It would surely be nice. I have added a simple procedure in the
How/Deliverables section, to depict what's going on in my mind when I am
thinking of a solution to this feature.



 It would also be good to talk a little about compatibility issues, such
 as how Wget could handle session dbs that were generated from newer
 versions of Wget, that might specify information that Wget doesn't know
 how to process.

I didn't quite get that. My proposal says about a slight code addition to
wget in its future releases, so older versions wouldn't have a session info
feature. Please correct me if I'm wrong.



 Any sample code from anything you've worked on in the past, even if it's
 incomplete or now embarrasses you (that's not a problem: just explain
 how you would improve it) would be very helpful; something to help
 determine whether you possess the appropriate skillset to handle the
 task you're proposing to implement.

I had submitted a small project back in school, 3 years ago, that
implemented a school management system. It was entirely written in C++, it
was around 1700lines of code, and now I am not able to find it. :( :( :(
I am still searching for it though; it might also be there at my school. I
will update you on it as soon as I find it. Other than that, I am afraid
there isn't much *code* I could present. Though if required, I can mention
the exact details of what all I intend to fill into sidb.h and sidb.c files
I have mentioned. Should I include that in the 

Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-26 Thread Siddhant Goel
Sorry I forgot to provide the link of my revised proposal, in the reply.
You can see it here - http://docs.google.com/Doc?id=ddkwv4gn_3dgvxf5c6
Regards,
Siddhant


Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-25 Thread Siddhant Goel
Hi.
I guess my concepts were twisted. I have made a new proposal, incorporating
what all you have suggested me, and what all I have been able to grasp.
I have proposed my solution here -
http://siddhantgoel.googlepages.com/gsoc_gnu.pdf .
Please tell me if this one is fine, or if something needs to be
added/removed/edited.
Thanks for your time.
Regards,
Siddhant


Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Siddhant Goel wrote:
 Hi.
 I guess my concepts were twisted. I have made a new proposal,
 incorporating what all you have suggested me, and what all I have been
 able to grasp.
 I have proposed my solution here -
 http://siddhantgoel.googlepages.com/gsoc_gnu.pdf .
 Please tell me if this one is fine, or if something needs to be
 added/removed/edited.

Thanks, Siddhant; I think this version is much improved over the
previous attempt.

I'd probably try not to get into the details of what the session db file
might look like; that's probably one of the things we'd need to define
as part of our initial discussions. I'm guessing that the list of
entries in your abstract are probably intended as only examples of the
_kind_ of data that would go in there, rather than the particulars of
how that data should look, but if I were you I'd say that explicitly.
(Also, that sort of information is probably not so appropriate in the
Abstract, which is usually intended to be a very high-level
description of what the feature is).

To comment on some of the specific items you have in that list: I'm not
sure that Download # will be a useful piece of information, except
perhaps for use in identifying a resource across interleaved entries.
The local_filename and localpath entries seem redundant to me. And
chain_of_redirects doesn't map very well to the expectation that Wget
write available information as soon as it can, so that as much
information is available as possible when it is read back, even if Wget
was interrupted. The chain of redirects will probably be better
implemented as separate entries for each redirect.

A boolean download_status strikes me as not informative enough for
what we will want it for. It would be very useful to distinguish between
a download that failed due to a 404 Not Found and one that failed due to
a 403 Forbidden; as well as distinguishing between loss of connections,
failure to connect, proxy-specific failures, and name resolution failures.

All in all, though, it's best not to focus on the details of what
entries might _look_ like, but how they will enable Wget to _use_ them.
That is, rather than saying that the session db will have localpath or
chain_of_redirects entries, it's more productive to say that Wget will
be able to be able to find the path of the locally-downloaded file
corresponding to any URI from the single URI or chain of redirected URIs
that; or that Wget will determine what files remain to be downloaded
and/or parsed to continue where a previous session left off.

It'd be really nice to touch on a little bit of _how_ Wget will handle
looking up a local pathname from a chain of redirects (i.e., if you're
not going to do it by using grep for each download, how _will_ you
accomplish it?).

It would also be good to talk a little about compatibility issues, such
as how Wget could handle session dbs that were generated from newer
versions of Wget, that might specify information that Wget doesn't know
how to process.

Any sample code from anything you've worked on in the past, even if it's
incomplete or now embarrasses you (that's not a problem: just explain
how you would improve it) would be very helpful; something to help
determine whether you possess the appropriate skillset to handle the
task you're proposing to implement.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH6Xz+7M8hyUobTrERArehAKCKXYkP39rreVaAFnErbs0Bq6+DJACeP5aG
zQkWsvyse8hkooZQ4tbYt10=
=W72v
-END PGP SIGNATURE-


Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-23 Thread Siddhant Goel
Hi!
Thanks for such a detailed explanation of the task involved. It surely did
help in creating a solid base. I have already started working on it.
I have also prepared the following proposal for GSoC -
http://siddhantgoel.googlepages.com/gsoc_gnu.pdf
It would be great if you could check it and tell me what all I need to
edit/remove/add in it, to make it better.
Thanks again!
Siddhant


Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-23 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Siddhant Goel wrote:
 Hi!
 Thanks for such a detailed explanation of the task involved. It surely
 did help in creating a solid base. I have already started working on it.
 I have also prepared the following proposal for GSoC -
 http://siddhantgoel.googlepages.com/gsoc_gnu.pdf
 It would be great if you could check it and tell me what all I need to
 edit/remove/add in it, to make it better.

Alright, here are some of my thoughts.

The abstract, in describing what it is you wish to build, includes a
length quote from the wiki's description, describing what sorts of
things could go in it, but not particularly what it is or how it is
useful. I'd focus on those rather than the list you have; particularly
since they describe a number of things you don't plan to implement in
this version

The How/Deliverables section is awfully vague, which is a problem
because this is the section we'd want to be specific enough to determine
whether you're on-target at your midterm and final evaluations. Talking
about creating a structure for an entry, and functions to create new
entries or write them to files, provides virtually no information about
what facilities you're actually trying to build. It's also not at all
clear to me what difference you intend for sidb.c versus sidb_functions.c.

Similarly, in the Timeline section, you have:

- - Upto May 26 : Get more comfortable with the source code, discussthings
out with mentor, and agree with him on the exact code to be written.
Happy mentor, happy student, nice code. :)
- - Start!
- - May 26 - Jun 15 : Finish the sidb.h and sidb_functions.c parts
- - Jun 15 - July 5 : Write sidb.c
- - July 5 - August 1 : Interface sidb.c with the rest of the Wget
source code
- - August 1 - August 15 : Check code, fix bugs, improve documentation
- - Finish!

This, too, says virtually nothing. Heck, I could hack up a sidb.h,
sidb_functions.c and sidb.c in about 30 seconds; they wouldn't _do_
anything, but then, there isn't really a clear idea what they _should_
do, from your proposal. Rather than talk about .h files and .c files,
and entry structures, which are implementation details, you should try
to focus on what your work will have actually accomplished by those
dates. Useful milestones might be:

 - Basic session info writing implemented (starts and ends of downloads,
redirects and local filenames recorded).
 - Basic session info reading implemented (small program reads back
data, maps URIs to local paths, adhering to execution time constraints
as outlined in spec).
 - Wget writes session databases (configurable via command-line
options), and is able to read them back in to determine what filenames
to check for timestamping or --continue.
 - Wget is able to continue aborted sessions using configuration and
last-known state from a session info db file.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH50I17M8hyUobTrERAqoLAJ9uP4Cfxk0GM6WuBz+11bGn8cxNEQCePI8G
3NtI5dkQUW2HBKzP22o8V7U=
=Evii
-END PGP SIGNATURE-


Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-21 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Charles wrote:
 On Fri, Mar 21, 2008 at 2:33 AM, Micah Cowan [EMAIL PROTECTED] wrote:
  The intent is that all the writer operations would take virtually no
  time at all. The sidb_read function should take at most O(N log N) time
  on the size of the SIDB file, and should take less than a second under
  normal circumstances on typical machines
 
 What encoding will this session database file use?

I was thinking ASCII, probably; with filenames and URIs percent-encoded
as necessary. Other values should probably follow suit (or else use
MIME's mechanisms for character encoding within headers, where that's
appropriate).

Probably direct byte values for non-ASCII value would be accepted (but
not generated), and assumed where possible to be appropriate for their
context, except when there's a danger of Wget sending malformed data.
NUL should be disallowed, though, because it's probably too much work to
support those in our current code (they would be necessary to some
character encodings if represented directly).

The idea is that it should be as absolutely easy as possible to manually
edit.

YAML uses UTF-8; I'm beginning to think YAML may not be what we want,
though, given that the definition for a given entry may be interposed
with defining content for other entries; I don't want to kludge that by
suffixing the names or something. I'll have to look more into YAML to
see how doable that is.

If we did end up using YAML, then obviously we wouldn't accept arbitrary
byte valus: it'd be UTF-8 only.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH4+zt7M8hyUobTrERAp9RAJ49fyIroAnskn1Iyo0UjI61DnrdNwCgkV85
+1+nO0xd1rr4zQZKGHYuhDM=
=sMQq
-END PGP SIGNATURE-


Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-21 Thread Charles
On Sat, Mar 22, 2008 at 12:14 AM, Micah Cowan [EMAIL PROTECTED] wrote:
  YAML uses UTF-8; I'm beginning to think YAML may not be what we want,
  though, given that the definition for a given entry may be interposed
  with defining content for other entries; I don't want to kludge that by
  suffixing the names or something. I'll have to look more into YAML to
  see how doable that is.

YAML handles hierarchical data, for example
Session:
  - date: 10102008
  - files:
-
  url: http://...
  - headers:
- status: 200

  If we did end up using YAML, then obviously we wouldn't accept arbitrary
  byte valus: it'd be UTF-8 only.

Isn't UTF-8 writes the same byte value with ASCII encoding for ASCII
characters? So we can pretend when reading the session file that it is
in UTF-8 while using ASCII when writing to it.


Session Info Database API concepts [regarding wget and gsoc]

2008-03-20 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

(I've rearranged your post slightly so that the answers are ordered more
conveniently for me.)

Siddhant Goel wrote:
 Hi.
 I couldn't gather much from the wget source code; so want to put up some
 questions.
 1. Does this necessarily have to be a change in the code of wget? Or
 something separate?

I don't see how a separate program could get all the necessary
information to put this together; particularly things like content-type,
etc. Perhaps some of the top use cases might be implemented with some
sort of wrapper around log files, but even there, without access to
internal Wget structures, I don't see how even file-existence checks
could be done (since the external program would have to know what Wget
was considering downloading).

 One thing I had in mind. Does grepping the database file, each time a
 download has to be made, sound good? This way you could check easily for
 every issue mentioned here -
 http://wget.addictivecode.org/FeatureSpecifications/MetaDataBase .

Using grep each time Wget has to decide whether to download would be
absolutely hideous, in terms of efficiency. Wget would have to fork a
new process for grep, which would then read the entire database file on
each download. Other operations might require Wget to invoke grep
multiple times.

OTOH, if Wget builds appropriate data structures (hash tables and the
like), it can determine in constant time whether a given URL was already
downloaded in a previous session.

No, grep is not a viable option. Besides, wget is intended to run on
non-Unixen (notably, Windows, and VMS too, at least to some degree);
requiring grep (or other external utilities) is not desirable.

 2. If it does, could you please tell me the files I should look into, to
 get started. I have the general solution in mind, but to implement I
 need to know where to start. :)

It would be wholly new code, so it'd be in a new module. You'd need to
interface with it from existing code, naturally; but to start, one might
wish to design and code the module completely apart from Wget, and write
little test drivers to demonstrate how it might be used from within Wget.

For instance, you might start by writing functions with prototypes like:

  /** SIDB Writer Facilities **/
  sidb_writer *
  sidb_write_start(const char *filename);

  sidb_writer_entry *
  sidb_writer_entry_new(sidb_writer *w, const char *uri);

  void
  sidb_writer_entry_redirect(sidb_writer_entry *rw, const char *uri, int
redirect_http_status_code);

  void
  sidb_writer_entry_local_path(sidb_writer_entry *rw, const char
*fname);

  void
  sidb_entry_finish(sidb_writer_entry *rw);

  void
  sidb_write_end(sidb_writer *w);

  /** SIDB Reader Facilities **/
  sidb_reader *
  sidb_read(const char *filename);

  sidb_entry *
  sidb_lookup_uri(struct sidb_reader *, const char *uri);

  const char *
  sidb_entry_get_local_name(sidb_entry *e);

sidb for Session Info DataBase. I've left out error code returns; it
seems to me that wget will not normally want to terminate just because
an error occurred in writing the session info database; we can ask the
sidb modules to spew warnings automatically when errors occur by giving
it flags, or supply it with an error callback function, etc. For
situations where we do want wget to immediately abort for sidb errors
(for instance, continuing a session), it could check a sidb_error
function or some such.

The intent is that all the writer operations would take virtually no
time at all. The sidb_read function should take at most O(N log N) time
on the size of the SIDB file, and should take less than a second under
normal circumstances on typical machines, for a file with entries for a
thousand web resources. Thereafter, the other SIDB reader operations
should take virtually no time at all.

The sidb_lookup_uri should be able to find an entry based on either the
URI that was specified to the corresponding call to
sidb_writer_entry_new, _or_ by any URI that was added to a resource
entry via sidb_writer_entry_redirect.

Interposing writes to different entrys should be allowed and explicitly
tested (to prepare the way for multiple simultaneous downloads in the
future). That is, the following should be valid:

  sidb_writer *sw = sidb_write_start(.wget-sidb);
  sidb_writer_entry *foo, *bar;
  foo = sidb_writer_entry_new(sw, http://example.com/foo;);
  bar = sidb_writer_entry_new(sw, http://example.com/bar;);
 /* Add info to foo entry. */
  sidb_writer_entry_redirect(foo,
http://newsite.example.com/news.html;);
 /* Add to bar entry. */
  sidb_writer_entry_local_path(bar, example.com/bar);
 /* Add to foo entry again. */
  sidb_writer_entry_local_name(foo, newsite.example.com/news.html);
  sidb_entry_finish(foo);
  sidb_entry_finish(bar);
  sidb_write_end(sw);

On reading back the information, calling sidb_lookup_uri for either
http://example.com/foo; or http://newsite.example.com/news.html;
should both give 

Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-20 Thread Charles
On Fri, Mar 21, 2008 at 2:33 AM, Micah Cowan [EMAIL PROTECTED] wrote:
  The intent is that all the writer operations would take virtually no
  time at all. The sidb_read function should take at most O(N log N) time
  on the size of the SIDB file, and should take less than a second under
  normal circumstances on typical machines

What encoding will this session database file use?