Re: GSoC : Web page scraper plugin
Hi Aamir, Please excuse me not getting back to you off-list, the message is in my drafts and I got distracted yesterday. At this stage if you intend on applying for the issue then I would advise you to get registered with GSoC, and begin writing up a publicly viewable draft submission. You have until the 6th to do so, so plenty of time. On Tue, Apr 3, 2012 at 5:45 AM, Aamir Khan syst3m.w...@gmail.com wrote: The project of web scraping at https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I understood the basic concept of the project but as I'm new to Nutch it will take some time to understand it fully in context of NUTCH. Well you have the summer to get up to speed with Nutch right? So I wouldn't necessarily worry too much about this just now. Just get your submission ready and we will take it from there. I'm looking forward for guidance from your side, how should I go about submitting a proposal for GSoC. If you feel you need help with any aspect of the issue or the submission then please get on to user@ and we will try to help out as much over there. In the meantime please see here [0] for guidance on your application submission. There is plenty of documentation and guidance over there. Thanks and again apologies for not getting back to you yesterday. Lewis [0] http://community.apache.org/gsoc.html Thanks in advance! -- Aamir Khan | 3rd Year | Computer Science Engineering | IIT Roorkee -- *Lewis*
Re: GSoC : Web page scraper plugin
On Tue, Apr 3, 2012 at 4:31 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Aamir, Please excuse me not getting back to you off-list, the message is in my drafts and I got distracted yesterday. No problem. At this stage if you intend on applying for the issue then I would advise you to get registered with GSoC, and begin writing up a publicly viewable draft submission. You have until the 6th to do so, so plenty of time. On Tue, Apr 3, 2012 at 5:45 AM, Aamir Khan syst3m.w...@gmail.com wrote: The project of web scraping at https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I understood the basic concept of the project but as I'm new to Nutch it will take some time to understand it fully in context of NUTCH. Well you have the summer to get up to speed with Nutch right? So I wouldn't necessarily worry too much about this just now. Just get your submission ready and we will take it from there. Exactly, I will have full summer to understand and get up to speed. But since my knowledge is very limited my proposal won't be too good.. :) I'm looking forward for guidance from your side, how should I go about submitting a proposal for GSoC. If you feel you need help with any aspect of the issue or the submission then please get on to user@ and we will try to help out as much over there. In the meantime please see here [0] for guidance on your application submission. There is plenty of documentation and guidance over there. Sure. Thanks and again apologies for not getting back to you yesterday. No problem.. :) Lewis [0] http://community.apache.org/gsoc.html Thanks in advance! -- Aamir Khan | 3rd Year | Computer Science Engineering | IIT Roorkee -- *Lewis* -- Aamir Khan | 3rd Year | Computer Science Engineering | IIT Roorkee
Re: GSoC : Web page scraper plugin
Hi Aamir, On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan syst3m.w...@gmail.com wrote: Exactly, I will have full summer to understand and get up to speed. But since my knowledge is very limited my proposal won't be too good.. :) This doesn't need to be the case. In fact it is crucial that the submission is of a reasonable quality. The original issue was pretty well discussed iirc, and additionally there is also some code uploaded by the original author so you could have a look at that over the next few days before making a crack at the submission. I can say one thing for sure though, this issue might need to be branded more generically... just now Nutch would benefit more from a generically oriented plugin for scraping various parts of html. The original author had a use case driven approach to this issue which meant he had to extract very specific content from news sites... this may not suit you, and certainly isn't absolutely everyone's cup of tea within the community. It would be great if you could discuss both in your application and on the Jira thread how the issue could be opened up, subsequently enabling more Nutch users to benefit... as you are stepping up to apply here, how you wish to do this is entirely your own choice so I would take the positives from the flexibility you have here and focus on them within your submission. Does this sounds reasonable? I look forward to seeing any progress you have and will seriously consider stepping up to be a potential mentor as it was me that added the issue to GSoC list of projects. Thank you Lewis
Re: GSoC : Web page scraper plugin
On Tue, Apr 3, 2012 at 4:45 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Aamir, On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan syst3m.w...@gmail.com wrote: Exactly, I will have full summer to understand and get up to speed. But since my knowledge is very limited my proposal won't be too good.. :) This doesn't need to be the case. In fact it is crucial that the submission is of a reasonable quality. The original issue was pretty well discussed iirc, and additionally there is also some code uploaded by the original author so you could have a look at that over the next few days before making a crack at the submission. I can say one thing for sure though, this issue might need to be branded more generically... just now Nutch would benefit more from a generically oriented plugin for scraping various parts of html. The original author had a use case driven approach to this issue which meant he had to extract very specific content from news sites... this may not suit you, and certainly isn't absolutely everyone's cup of tea within the community. It would be great if you could discuss both in your application and on the Jira thread how the issue could be opened up, subsequently enabling more Nutch users to benefit... as you are stepping up to apply here, how you wish to do this is entirely your own choice so I would take the positives from the flexibility you have here and focus on them within your submission. Does this sounds reasonable? Sounds good to me. Where can I chat with nutch-developers ? not many people are present on IRC channel #nutch. BTW, I created a rough draft with all my personal bio and other necessary information and submitted to google-melange [1]. I will update the project schedule soon preferably after having some discussions. [1] = http://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2012/syst3mw0rm/9001 I look forward to seeing any progress you have and will seriously consider stepping up to be a potential mentor as it was me that added the issue to GSoC list of projects. that would be great!! Thank you Lewis -- Aamir Khan | 3rd Year | Computer Science Engineering | IIT Roorkee