Re: [Bug-wget] Google Summer of Code 2016
On 03/02, Kushagra Singh wrote: Hi, Thanks for the quick reply. I went through the repository and the issues, and found a couple of things I would like to work on. I have a couple of questions about Wget2. Is it a complete rewrite of the Wget project, available at git://git.savannah.gnu.org/wget.git, or are we using existing code and extending functionality? I guess it is the second one because I saw `libwget` in the repo. However if such is the case, then how do we change existing functions in wget? For example, implementing [2] would require making changes to the file cookies.c, which is present in /src in the wget repo, but not in /src in the wget2 repo. Wget2 is a complete rewrite of GNU Wget. It is also available on the savannah server as its own repository at [1]. Wget2 is meant to be a modern (almost) drop-in replacement for Wget. It strives to maintain backward compatible command line options and behaviour as far as it makes sense. The codebase for the two projects has diverged by significant amounts and hence new features need to be implemented separately for each. I was looking at #43 [1], and have already submitted a patch for consideration for the first suggestion [2]. The second suggestion mentioned [3] is one of the things I'd like to work on, however this is not something which will take three months :) You submitted a patch for Wget. This is the Wget2 repository. Anyways, I already have a working patch for most of that issue, got sidetracked when writing the tests and eventually forgot about it. I think I'll spend some time on it this week and have that patch merged. Don't spend time on that part. Another thing to remember is, not all GitHub issues are valid GSoC projects. Since the number of issues is few, it is easy to scout out the larger ones. Some issues are pretty tiny, just need someone willing to spend time working on them. Another project I am interested in, is implementing FTPS. I saw this listed under one of the ideas of GSoC 2015, but I'm not sure whether it was implemented, as I didn't see it under 'Development Status' in the wget2 readme on Github. Wget2 as far as I'm aware is still lacking FTPS support. Remember that Wget and Wget2 are two different projects. Also, in #67 [4], we are talking about adhering to some specific parts of RFC 7230. I'm not sure which all parts would be right, as the discussion thread mentions that it won't be good to stick to each point of the RFC. WDYT? This is a minor grievance I raised. We stick to most of it anyways. As Tim points out, being completely RFC compliant may make the tool unusuable thanks to the number of bad servers out there. If anything, that issue needs to be split into multiple smaller issues about specific parts of the RFC that we want to adhere to. Open projects I currently see are: 1. FTP / FTPS support 2. SOCKS5 Proxy support (This may be too small.) 3. Progress Bar implementation (Looks deceptively simple, isn't) 4. WARC support and tests 5. Brotli compression (May be too small) The README file also has more pointers on features not implemented in Wget2. You may get some ideas from there. Request pipelining and DNSSEC are two features I'd be interested in seeing implemented. Moreover, you are always welcome to submit your own ideas for either Wget or Wget2. Tim can add more details or comment on whether something is too small to work on for a GSoC project. [1]: git://git.savannah.gnu.org/wget/wget2.git [1] https://github.com/rockdaboot/wget2/issues/43 [2] https://tools.ietf.org/html/draft-west-leave-secure-cookies-alone-04 [3] https://tools.ietf.org/html/draft-west-cookie-prefixes-05 [4] https://github.com/rockdaboot/wget2/issues/67 On Tue, Mar 1, 2016 at 9:57 PM, Giuseppe Scrivanowrote: Kushagra Singh writes: > Hi, > > Will we be taking part in GSoC this year? I would really like to work on a > project related to Wget this summer. Any specific ideas that are of > importance to the community presently? yes, we will be take part in GSoC. I think we would like to see more work happening on wget2, at the moment there is a list of issues on github that can be useful to you to pick some ideas to work on: https://github.com/rockdaboot/wget2/issues Could you take a look at it? Do you see anything interesting that you would like to work on? Regards, Giuseppe -- Thanking You, Darshit Shah signature.asc Description: PGP signature
Re: [Bug-wget] Google Summer of Code 2016
Hi, Thanks for the quick reply. I went through the repository and the issues, and found a couple of things I would like to work on. I have a couple of questions about Wget2. Is it a complete rewrite of the Wget project, available at git://git.savannah.gnu.org/wget.git, or are we using existing code and extending functionality? I guess it is the second one because I saw `libwget` in the repo. However if such is the case, then how do we change existing functions in wget? For example, implementing [2] would require making changes to the file cookies.c, which is present in /src in the wget repo, but not in /src in the wget2 repo. I was looking at #43 [1], and have already submitted a patch for consideration for the first suggestion [2]. The second suggestion mentioned [3] is one of the things I'd like to work on, however this is not something which will take three months :) Another project I am interested in, is implementing FTPS. I saw this listed under one of the ideas of GSoC 2015, but I'm not sure whether it was implemented, as I didn't see it under 'Development Status' in the wget2 readme on Github. Also, in #67 [4], we are talking about adhering to some specific parts of RFC 7230. I'm not sure which all parts would be right, as the discussion thread mentions that it won't be good to stick to each point of the RFC. WDYT? [1] https://github.com/rockdaboot/wget2/issues/43 [2] https://tools.ietf.org/html/draft-west-leave-secure-cookies-alone-04 [3] https://tools.ietf.org/html/draft-west-cookie-prefixes-05 [4] https://github.com/rockdaboot/wget2/issues/67 On Tue, Mar 1, 2016 at 9:57 PM, Giuseppe Scrivanowrote: > Kushagra Singh writes: > > > Hi, > > > > Will we be taking part in GSoC this year? I would really like to work on > a > > project related to Wget this summer. Any specific ideas that are of > > importance to the community presently? > > yes, we will be take part in GSoC. I think we would like to see more > work happening on wget2, at the moment there is a list of issues on > github that can be useful to you to pick some ideas to work on: > > https://github.com/rockdaboot/wget2/issues > > Could you take a look at it? Do you see anything interesting that you > would like to work on? > > Regards, > Giuseppe >
Re: [Bug-wget] Google Summer of Code 2016
Kushagra Singhwrites: > Hi, > > Will we be taking part in GSoC this year? I would really like to work on a > project related to Wget this summer. Any specific ideas that are of > importance to the community presently? yes, we will be take part in GSoC. I think we would like to see more work happening on wget2, at the moment there is a list of issues on github that can be useful to you to pick some ideas to work on: https://github.com/rockdaboot/wget2/issues Could you take a look at it? Do you see anything interesting that you would like to work on? Regards, Giuseppe
Re: [Bug-wget] Patch for understanding srcset= on img tags.
> thanks for your patch! I have some comments. Please amend this: > > diff --git a/src/html-url.c b/src/html-url.c > index dff8d57..2f205c7 100644 > --- a/src/html-url.c > +++ b/src/html-url.c > @@ -692,8 +692,8 @@ tag_handle_img (int tagid, struct taginfo *tag, struct > map_context *ctx) { >if (srcset) > { >/* These are relative to the input text. */ > - int base_ind = ATTR_POS(tag,attrind,ctx); > - int size = strlen(srcset); > + int base_ind = ATTR_POS (tag, attrind, ctx); > + int size = strlen (srcset); Done. > should the condition be (c == ')' && in_paren) ? Indeed. Thanks, Maks From 49933c84012536388e1f9d0bc4070e377d824309 Mon Sep 17 00:00:00 2001 From: Maks OrlovichDate: Tue, 1 Mar 2016 09:43:56 -0500 Subject: Parse attributes, they have image URLs. * src/convert.h: Add link_noquote_html_p to permit rewriting URLs deep inside attributes without adding extraneous quoting * src/convert.c (convert_links): Honor link_noquote_html_p * src/html_url.c (tag_handle_img): New function. Add srcset parsing. diff --git a/src/convert.c b/src/convert.c index df8d58d..509923e 100644 --- a/src/convert.c +++ b/src/convert.c @@ -308,7 +308,7 @@ convert_links (const char *file, struct urlpos *links) char *quoted_newname = local_quote_string (newname, link->link_css_p); -if (link->link_css_p) +if (link->link_css_p || link->link_noquote_html_p) p = replace_plain (p, link->size, fp, quoted_newname); else if (!link->link_refresh_p) p = replace_attr (p, link->size, fp, quoted_newname); @@ -329,7 +329,7 @@ convert_links (const char *file, struct urlpos *links) char *newname = convert_basename (p, link); char *quoted_newname = local_quote_string (newname, link->link_css_p); -if (link->link_css_p) +if (link->link_css_p || link->link_noquote_html_p) p = replace_plain (p, link->size, fp, quoted_newname); else if (!link->link_refresh_p) p = replace_attr (p, link->size, fp, quoted_newname); @@ -352,7 +352,7 @@ convert_links (const char *file, struct urlpos *links) char *newlink = link->url->url; char *quoted_newlink = html_quote_string (newlink); -if (link->link_css_p) +if (link->link_css_p || link->link_noquote_html_p) p = replace_plain (p, link->size, fp, newlink); else if (!link->link_refresh_p) p = replace_attr (p, link->size, fp, quoted_newlink); diff --git a/src/convert.h b/src/convert.h index b3cd196..e3ff6f0 100644 --- a/src/convert.h +++ b/src/convert.h @@ -69,6 +69,7 @@ struct urlpos { unsigned int link_base_p :1; /* the url came from */ unsigned int link_inline_p:1; /* needed to render the page */ unsigned int link_css_p :1; /* the url came from CSS */ + unsigned int link_noquote_html_p :1; /* from HTML, but doesn't need " */ unsigned int link_expect_html :1; /* expected to contain HTML */ unsigned int link_expect_css :1; /* expected to contain CSS */ diff --git a/src/html-url.c b/src/html-url.c index 0743587..ab04204 100644 --- a/src/html-url.c +++ b/src/html-url.c @@ -56,6 +56,7 @@ typedef void (*tag_handler_t) (int, struct taginfo *, struct map_context *); DECLARE_TAG_HANDLER (tag_find_urls); DECLARE_TAG_HANDLER (tag_handle_base); DECLARE_TAG_HANDLER (tag_handle_form); +DECLARE_TAG_HANDLER (tag_handle_img); DECLARE_TAG_HANDLER (tag_handle_link); DECLARE_TAG_HANDLER (tag_handle_meta); @@ -105,7 +106,7 @@ static struct known_tag { { TAG_FORM,"form",tag_handle_form }, { TAG_FRAME, "frame", tag_find_urls }, { TAG_IFRAME, "iframe", tag_find_urls }, - { TAG_IMG, "img", tag_find_urls }, + { TAG_IMG, "img", tag_handle_img }, { TAG_INPUT, "input", tag_find_urls }, { TAG_LAYER, "layer", tag_find_urls }, { TAG_LINK,"link",tag_handle_link }, @@ -183,7 +184,8 @@ static const char *additional_attributes[] = { "name", /* used by tag_handle_meta */ "content",/* used by tag_handle_meta */ "action", /* used by tag_handle_form */ - "style" /* used by check_style_attr */ + "style", /* used by check_style_attr */ + "srcset", /* used by tag_handle_img */ }; static struct hash_table *interesting_tags; @@ -674,6 +676,88 @@ tag_handle_meta (int tagid _GL_UNUSED, struct taginfo *tag, struct map_context * } } +/* Handle the IMG tag. This requires special handling for the srcset attr, + while the traditional src/lowsrc/href attributes can be handled generically. +*/ + +static void +tag_handle_img (int tagid, struct taginfo *tag, struct
Re: [Bug-wget] Patch for understanding srcset= on img tags.
Hi Maksim, Maksim Orlovichwrites: > Hi... wget currently doesn't understand HTML5's srcset= attribute on > images. The attached adds support for it. > This is under Google copyright, so should be covered by the company's > copyright assignment with the FSF. > > If you might be interested in incorporating this in some form, please > let me know if you want any changes (e.g. tests, etc.), --- > not really familiar with how you folks do things. > > Hoping this may be of some use to someone else, > Maks thanks for your patch! I have some comments. Please amend this: diff --git a/src/html-url.c b/src/html-url.c index dff8d57..2f205c7 100644 --- a/src/html-url.c +++ b/src/html-url.c @@ -692,8 +692,8 @@ tag_handle_img (int tagid, struct taginfo *tag, struct map_context *ctx) { if (srcset) { /* These are relative to the input text. */ - int base_ind = ATTR_POS(tag,attrind,ctx); - int size = strlen(srcset); + int base_ind = ATTR_POS (tag, attrind, ctx); + int size = strlen (srcset); /* These are relative to srcset. */ int offset, url_start, url_end; > + /* If the URL wasn't terminated by a , there may also be a > descriptor > + which we just skip. */ > + if (has_descriptor) > +{ > + /* This is comma-terminated, except there may be one level of > + parentheses escaping that. */ > + bool in_paren = false; > + for (offset = url_end; offset < size; ++offset) > +{ > + char c = srcset[offset]; > + if (c == '(') > +in_paren = true; > + else if (c == '(' && in_paren) > +in_paren = false; should the condition be (c == ')' && in_paren) ? Thanks, Giuseppe
[Bug-wget] Google Summer of Code 2016
Hi, Will we be taking part in GSoC this year? I would really like to work on a project related to Wget this summer. Any specific ideas that are of importance to the community presently? A quick introduction, I'm Kushagra Singh, a second year student at IIIT Delhi, India. My major is Computer Science. I have gone through a particular chunk of wget's source code and understand it well, and have submitted a patch for consideration. I successfully completed GSoC, working with lmonade last summer. Looking forward to code on this project this summer! Thank you, Kushagra Singh