Re: Session Info Database API concepts [regarding wget and gsoc]
Thanks, Siddhant; I think this version is much improved over the previous attempt. Thank *you*! I have revised it again and its waiting for you! :) I'd probably try not to get into the details of what the session db file might look like; that's probably one of the things we'd need to define as part of our initial discussions. I'm guessing that the list of entries in your abstract are probably intended as only examples of the _kind_ of data that would go in there, rather than the particulars of how that data should look, but if I were you I'd say that explicitly. (Also, that sort of information is probably not so appropriate in the Abstract, which is usually intended to be a very high-level description of what the feature is). I have now mentioned, that the list I am providing gives a fair enough *idea* of what needs to be implemented. Also, I agree that that list should be there in the How/Deliverables section ,rather than being in the Abstract. Corrected. To comment on some of the specific items you have in that list: I'm not sure that Download # will be a useful piece of information, except perhaps for use in identifying a resource across interleaved entries. I forgot to mention in the last proposal. I will use a download number to mainly address the issue of redirect(s). Suppose if there is a file that is being downloaded, and there happens a redirect, the current entry will be written to the file, and a new entry will start being written. The download # of this new entry will be the value of chain_of_redirects for the previous entry. The local_filename and localpath entries seem redundant to me. Me too. I do not know what got into my head when I made such a stupid assumption. :( Corrected now. A boolean download_status strikes me as not informative enough for what we will want it for. It would be very useful to distinguish between a download that failed due to a 404 Not Found and one that failed due to a 403 Forbidden; as well as distinguishing between loss of connections, failure to connect, proxy-specific failures, and name resolution failures. For this issue, I have made two sections in an entry. A download_status value (which should be 1 if the download was successful, and 0 if it was not), and a status_reason value (which, as I see it, will contain specific codes for specific failure reasons. 1 could be for a 404, 2 could be for a 403, 3 could be for a name resolution error, and so on). All in all, though, it's best not to focus on the details of what entries might _look_ like, but how they will enable Wget to _use_ them. That is, rather than saying that the session db will have localpath or chain_of_redirects entries, it's more productive to say that Wget will be able to be able to find the path of the locally-downloaded file corresponding to any URI from the single URI or chain of redirected URIs that; or that Wget will determine what files remain to be downloaded and/or parsed to continue where a previous session left off. This information should now be clear, as I have explicitly mentioned each and every use for each and every section in an entry, along with providing a simple procedure for a lookup at the end of the How/Deliverables section. It'd be really nice to touch on a little bit of _how_ Wget will handle looking up a local pathname from a chain of redirects (i.e., if you're not going to do it by using grep for each download, how _will_ you accomplish it?). Yes. It would surely be nice. I have added a simple procedure in the How/Deliverables section, to depict what's going on in my mind when I am thinking of a solution to this feature. It would also be good to talk a little about compatibility issues, such as how Wget could handle session dbs that were generated from newer versions of Wget, that might specify information that Wget doesn't know how to process. I didn't quite get that. My proposal says about a slight code addition to wget in its future releases, so older versions wouldn't have a session info feature. Please correct me if I'm wrong. Any sample code from anything you've worked on in the past, even if it's incomplete or now embarrasses you (that's not a problem: just explain how you would improve it) would be very helpful; something to help determine whether you possess the appropriate skillset to handle the task you're proposing to implement. I had submitted a small project back in school, 3 years ago, that implemented a school management system. It was entirely written in C++, it was around 1700lines of code, and now I am not able to find it. :( :( :( I am still searching for it though; it might also be there at my school. I will update you on it as soon as I find it. Other than that, I am afraid there isn't much *code* I could present. Though if required, I can mention the exact details of what all I intend to fill into sidb.h and sidb.c files I have mentioned. Should I include that in the
Re: Session Info Database API concepts [regarding wget and gsoc]
Sorry I forgot to provide the link of my revised proposal, in the reply. You can see it here - http://docs.google.com/Doc?id=ddkwv4gn_3dgvxf5c6 Regards, Siddhant
Re: Session Info Database API concepts [regarding wget and gsoc]
Hi. I guess my concepts were twisted. I have made a new proposal, incorporating what all you have suggested me, and what all I have been able to grasp. I have proposed my solution here - http://siddhantgoel.googlepages.com/gsoc_gnu.pdf . Please tell me if this one is fine, or if something needs to be added/removed/edited. Thanks for your time. Regards, Siddhant
Re: Session Info Database API concepts [regarding wget and gsoc]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Siddhant Goel wrote: Hi. I guess my concepts were twisted. I have made a new proposal, incorporating what all you have suggested me, and what all I have been able to grasp. I have proposed my solution here - http://siddhantgoel.googlepages.com/gsoc_gnu.pdf . Please tell me if this one is fine, or if something needs to be added/removed/edited. Thanks, Siddhant; I think this version is much improved over the previous attempt. I'd probably try not to get into the details of what the session db file might look like; that's probably one of the things we'd need to define as part of our initial discussions. I'm guessing that the list of entries in your abstract are probably intended as only examples of the _kind_ of data that would go in there, rather than the particulars of how that data should look, but if I were you I'd say that explicitly. (Also, that sort of information is probably not so appropriate in the Abstract, which is usually intended to be a very high-level description of what the feature is). To comment on some of the specific items you have in that list: I'm not sure that Download # will be a useful piece of information, except perhaps for use in identifying a resource across interleaved entries. The local_filename and localpath entries seem redundant to me. And chain_of_redirects doesn't map very well to the expectation that Wget write available information as soon as it can, so that as much information is available as possible when it is read back, even if Wget was interrupted. The chain of redirects will probably be better implemented as separate entries for each redirect. A boolean download_status strikes me as not informative enough for what we will want it for. It would be very useful to distinguish between a download that failed due to a 404 Not Found and one that failed due to a 403 Forbidden; as well as distinguishing between loss of connections, failure to connect, proxy-specific failures, and name resolution failures. All in all, though, it's best not to focus on the details of what entries might _look_ like, but how they will enable Wget to _use_ them. That is, rather than saying that the session db will have localpath or chain_of_redirects entries, it's more productive to say that Wget will be able to be able to find the path of the locally-downloaded file corresponding to any URI from the single URI or chain of redirected URIs that; or that Wget will determine what files remain to be downloaded and/or parsed to continue where a previous session left off. It'd be really nice to touch on a little bit of _how_ Wget will handle looking up a local pathname from a chain of redirects (i.e., if you're not going to do it by using grep for each download, how _will_ you accomplish it?). It would also be good to talk a little about compatibility issues, such as how Wget could handle session dbs that were generated from newer versions of Wget, that might specify information that Wget doesn't know how to process. Any sample code from anything you've worked on in the past, even if it's incomplete or now embarrasses you (that's not a problem: just explain how you would improve it) would be very helpful; something to help determine whether you possess the appropriate skillset to handle the task you're proposing to implement. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH6Xz+7M8hyUobTrERArehAKCKXYkP39rreVaAFnErbs0Bq6+DJACeP5aG zQkWsvyse8hkooZQ4tbYt10= =W72v -END PGP SIGNATURE-
Re: Session Info Database API concepts [regarding wget and gsoc]
Hi! Thanks for such a detailed explanation of the task involved. It surely did help in creating a solid base. I have already started working on it. I have also prepared the following proposal for GSoC - http://siddhantgoel.googlepages.com/gsoc_gnu.pdf It would be great if you could check it and tell me what all I need to edit/remove/add in it, to make it better. Thanks again! Siddhant
Re: Session Info Database API concepts [regarding wget and gsoc]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Siddhant Goel wrote: Hi! Thanks for such a detailed explanation of the task involved. It surely did help in creating a solid base. I have already started working on it. I have also prepared the following proposal for GSoC - http://siddhantgoel.googlepages.com/gsoc_gnu.pdf It would be great if you could check it and tell me what all I need to edit/remove/add in it, to make it better. Alright, here are some of my thoughts. The abstract, in describing what it is you wish to build, includes a length quote from the wiki's description, describing what sorts of things could go in it, but not particularly what it is or how it is useful. I'd focus on those rather than the list you have; particularly since they describe a number of things you don't plan to implement in this version The How/Deliverables section is awfully vague, which is a problem because this is the section we'd want to be specific enough to determine whether you're on-target at your midterm and final evaluations. Talking about creating a structure for an entry, and functions to create new entries or write them to files, provides virtually no information about what facilities you're actually trying to build. It's also not at all clear to me what difference you intend for sidb.c versus sidb_functions.c. Similarly, in the Timeline section, you have: - - Upto May 26 : Get more comfortable with the source code, discussthings out with mentor, and agree with him on the exact code to be written. Happy mentor, happy student, nice code. :) - - Start! - - May 26 - Jun 15 : Finish the sidb.h and sidb_functions.c parts - - Jun 15 - July 5 : Write sidb.c - - July 5 - August 1 : Interface sidb.c with the rest of the Wget source code - - August 1 - August 15 : Check code, fix bugs, improve documentation - - Finish! This, too, says virtually nothing. Heck, I could hack up a sidb.h, sidb_functions.c and sidb.c in about 30 seconds; they wouldn't _do_ anything, but then, there isn't really a clear idea what they _should_ do, from your proposal. Rather than talk about .h files and .c files, and entry structures, which are implementation details, you should try to focus on what your work will have actually accomplished by those dates. Useful milestones might be: - Basic session info writing implemented (starts and ends of downloads, redirects and local filenames recorded). - Basic session info reading implemented (small program reads back data, maps URIs to local paths, adhering to execution time constraints as outlined in spec). - Wget writes session databases (configurable via command-line options), and is able to read them back in to determine what filenames to check for timestamping or --continue. - Wget is able to continue aborted sessions using configuration and last-known state from a session info db file. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH50I17M8hyUobTrERAqoLAJ9uP4Cfxk0GM6WuBz+11bGn8cxNEQCePI8G 3NtI5dkQUW2HBKzP22o8V7U= =Evii -END PGP SIGNATURE-
Re: Session Info Database API concepts [regarding wget and gsoc]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Charles wrote: On Fri, Mar 21, 2008 at 2:33 AM, Micah Cowan [EMAIL PROTECTED] wrote: The intent is that all the writer operations would take virtually no time at all. The sidb_read function should take at most O(N log N) time on the size of the SIDB file, and should take less than a second under normal circumstances on typical machines What encoding will this session database file use? I was thinking ASCII, probably; with filenames and URIs percent-encoded as necessary. Other values should probably follow suit (or else use MIME's mechanisms for character encoding within headers, where that's appropriate). Probably direct byte values for non-ASCII value would be accepted (but not generated), and assumed where possible to be appropriate for their context, except when there's a danger of Wget sending malformed data. NUL should be disallowed, though, because it's probably too much work to support those in our current code (they would be necessary to some character encodings if represented directly). The idea is that it should be as absolutely easy as possible to manually edit. YAML uses UTF-8; I'm beginning to think YAML may not be what we want, though, given that the definition for a given entry may be interposed with defining content for other entries; I don't want to kludge that by suffixing the names or something. I'll have to look more into YAML to see how doable that is. If we did end up using YAML, then obviously we wouldn't accept arbitrary byte valus: it'd be UTF-8 only. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH4+zt7M8hyUobTrERAp9RAJ49fyIroAnskn1Iyo0UjI61DnrdNwCgkV85 +1+nO0xd1rr4zQZKGHYuhDM= =sMQq -END PGP SIGNATURE-
Re: Session Info Database API concepts [regarding wget and gsoc]
On Sat, Mar 22, 2008 at 12:14 AM, Micah Cowan [EMAIL PROTECTED] wrote: YAML uses UTF-8; I'm beginning to think YAML may not be what we want, though, given that the definition for a given entry may be interposed with defining content for other entries; I don't want to kludge that by suffixing the names or something. I'll have to look more into YAML to see how doable that is. YAML handles hierarchical data, for example Session: - date: 10102008 - files: - url: http://... - headers: - status: 200 If we did end up using YAML, then obviously we wouldn't accept arbitrary byte valus: it'd be UTF-8 only. Isn't UTF-8 writes the same byte value with ASCII encoding for ASCII characters? So we can pretend when reading the session file that it is in UTF-8 while using ASCII when writing to it.
Session Info Database API concepts [regarding wget and gsoc]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 (I've rearranged your post slightly so that the answers are ordered more conveniently for me.) Siddhant Goel wrote: Hi. I couldn't gather much from the wget source code; so want to put up some questions. 1. Does this necessarily have to be a change in the code of wget? Or something separate? I don't see how a separate program could get all the necessary information to put this together; particularly things like content-type, etc. Perhaps some of the top use cases might be implemented with some sort of wrapper around log files, but even there, without access to internal Wget structures, I don't see how even file-existence checks could be done (since the external program would have to know what Wget was considering downloading). One thing I had in mind. Does grepping the database file, each time a download has to be made, sound good? This way you could check easily for every issue mentioned here - http://wget.addictivecode.org/FeatureSpecifications/MetaDataBase . Using grep each time Wget has to decide whether to download would be absolutely hideous, in terms of efficiency. Wget would have to fork a new process for grep, which would then read the entire database file on each download. Other operations might require Wget to invoke grep multiple times. OTOH, if Wget builds appropriate data structures (hash tables and the like), it can determine in constant time whether a given URL was already downloaded in a previous session. No, grep is not a viable option. Besides, wget is intended to run on non-Unixen (notably, Windows, and VMS too, at least to some degree); requiring grep (or other external utilities) is not desirable. 2. If it does, could you please tell me the files I should look into, to get started. I have the general solution in mind, but to implement I need to know where to start. :) It would be wholly new code, so it'd be in a new module. You'd need to interface with it from existing code, naturally; but to start, one might wish to design and code the module completely apart from Wget, and write little test drivers to demonstrate how it might be used from within Wget. For instance, you might start by writing functions with prototypes like: /** SIDB Writer Facilities **/ sidb_writer * sidb_write_start(const char *filename); sidb_writer_entry * sidb_writer_entry_new(sidb_writer *w, const char *uri); void sidb_writer_entry_redirect(sidb_writer_entry *rw, const char *uri, int redirect_http_status_code); void sidb_writer_entry_local_path(sidb_writer_entry *rw, const char *fname); void sidb_entry_finish(sidb_writer_entry *rw); void sidb_write_end(sidb_writer *w); /** SIDB Reader Facilities **/ sidb_reader * sidb_read(const char *filename); sidb_entry * sidb_lookup_uri(struct sidb_reader *, const char *uri); const char * sidb_entry_get_local_name(sidb_entry *e); sidb for Session Info DataBase. I've left out error code returns; it seems to me that wget will not normally want to terminate just because an error occurred in writing the session info database; we can ask the sidb modules to spew warnings automatically when errors occur by giving it flags, or supply it with an error callback function, etc. For situations where we do want wget to immediately abort for sidb errors (for instance, continuing a session), it could check a sidb_error function or some such. The intent is that all the writer operations would take virtually no time at all. The sidb_read function should take at most O(N log N) time on the size of the SIDB file, and should take less than a second under normal circumstances on typical machines, for a file with entries for a thousand web resources. Thereafter, the other SIDB reader operations should take virtually no time at all. The sidb_lookup_uri should be able to find an entry based on either the URI that was specified to the corresponding call to sidb_writer_entry_new, _or_ by any URI that was added to a resource entry via sidb_writer_entry_redirect. Interposing writes to different entrys should be allowed and explicitly tested (to prepare the way for multiple simultaneous downloads in the future). That is, the following should be valid: sidb_writer *sw = sidb_write_start(.wget-sidb); sidb_writer_entry *foo, *bar; foo = sidb_writer_entry_new(sw, http://example.com/foo;); bar = sidb_writer_entry_new(sw, http://example.com/bar;); /* Add info to foo entry. */ sidb_writer_entry_redirect(foo, http://newsite.example.com/news.html;); /* Add to bar entry. */ sidb_writer_entry_local_path(bar, example.com/bar); /* Add to foo entry again. */ sidb_writer_entry_local_name(foo, newsite.example.com/news.html); sidb_entry_finish(foo); sidb_entry_finish(bar); sidb_write_end(sw); On reading back the information, calling sidb_lookup_uri for either http://example.com/foo; or http://newsite.example.com/news.html; should both give
Re: Session Info Database API concepts [regarding wget and gsoc]
On Fri, Mar 21, 2008 at 2:33 AM, Micah Cowan [EMAIL PROTECTED] wrote: The intent is that all the writer operations would take virtually no time at all. The sidb_read function should take at most O(N log N) time on the size of the SIDB file, and should take less than a second under normal circumstances on typical machines What encoding will this session database file use?