Re: GSoC : Web page scraper plugin

2012-04-03 Thread Lewis John Mcgibbney
Hi Aamir,

Please excuse me not getting back to you off-list, the message is in my
drafts and I got distracted yesterday.

At this stage if you intend on applying for the issue then I would advise
you to get registered with GSoC, and begin writing up a publicly viewable
draft submission. You have until the 6th to do so, so plenty of time.

On Tue, Apr 3, 2012 at 5:45 AM, Aamir Khan syst3m.w...@gmail.com wrote:


 The project of web scraping at
 https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I
 understood the basic concept of the project but as I'm new to Nutch it will
 take some time to understand it fully in context of NUTCH.


Well you have the summer to get up to speed with Nutch right? So I wouldn't
necessarily worry too much about this just now. Just get your submission
ready and we will take it from there.


 I'm looking forward for guidance from your side, how should I go about
 submitting a proposal for GSoC.


If you feel you need help with any aspect of the issue or the submission
then please get on to user@ and we will try to help out as much over there.
In the meantime please see here [0] for guidance on your application
submission. There is plenty of documentation and guidance over there.

Thanks and again apologies for not getting back to you yesterday.

Lewis

[0] http://community.apache.org/gsoc.html



 Thanks in advance!





 --
 Aamir Khan | 3rd Year  | Computer Science  Engineering | IIT Roorkee






-- 
*Lewis*


Re: GSoC : Web page scraper plugin

2012-04-03 Thread Aamir Khan
On Tue, Apr 3, 2012 at 4:31 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Aamir,

 Please excuse me not getting back to you off-list, the message is in my
 drafts and I got distracted yesterday.


No problem.


 At this stage if you intend on applying for the issue then I would advise
 you to get registered with GSoC, and begin writing up a publicly viewable
 draft submission. You have until the 6th to do so, so plenty of time.

 On Tue, Apr 3, 2012 at 5:45 AM, Aamir Khan syst3m.w...@gmail.com wrote:


 The project of web scraping at
 https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I
 understood the basic concept of the project but as I'm new to Nutch it will
 take some time to understand it fully in context of NUTCH.


 Well you have the summer to get up to speed with Nutch right? So I
 wouldn't necessarily worry too much about this just now. Just get your
 submission ready and we will take it from there.


Exactly, I will have full summer to understand and get up to speed. But
since my knowledge is very limited my proposal won't be too good.. :)


 I'm looking forward for guidance from your side, how should I go about
 submitting a proposal for GSoC.


 If you feel you need help with any aspect of the issue or the submission
 then please get on to user@ and we will try to help out as much over
 there. In the meantime please see here [0] for guidance on your application
 submission. There is plenty of documentation and guidance over there.


Sure.


 Thanks and again apologies for not getting back to you yesterday.


No problem.. :)


 Lewis

 [0] http://community.apache.org/gsoc.html



 Thanks in advance!





 --
 Aamir Khan | 3rd Year  | Computer Science  Engineering | IIT Roorkee






 --
 *Lewis*




-- 
Aamir Khan | 3rd Year  | Computer Science  Engineering | IIT Roorkee


Re: GSoC : Web page scraper plugin

2012-04-03 Thread Lewis John Mcgibbney
Hi Aamir,

On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan syst3m.w...@gmail.com wrote:


 Exactly, I will have full summer to understand and get up to speed. But
 since my knowledge is very limited my proposal won't be too good.. :)


 This doesn't need to be the case. In fact it is crucial that the
submission is of a reasonable quality. The original issue was pretty well
discussed iirc, and additionally there is also some code uploaded by the
original author so you could have a look at that over the next few days
before making a crack at the submission. I can say one thing for sure
though, this issue might need to be branded more generically... just now
Nutch would benefit more from a generically oriented plugin for scraping
various parts of html. The original author had a use case driven approach
to this issue which meant he had to extract very specific content from news
sites... this may not suit you, and certainly isn't absolutely everyone's
cup of tea within the community. It would be great if you could discuss
both in your application and on the Jira thread how the issue could be
opened up, subsequently enabling more Nutch users to benefit... as you are
stepping up to apply here, how you wish to do this is entirely your own
choice so I would take the positives from the flexibility you have here and
focus on them within your submission. Does this sounds reasonable?

I look forward to seeing any progress you have and will seriously consider
stepping up to be a potential mentor as it was me that added the issue to
GSoC list of projects.

Thank you

Lewis


Re: GSoC : Web page scraper plugin

2012-04-03 Thread Aamir Khan
On Tue, Apr 3, 2012 at 4:45 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Aamir,


 On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan syst3m.w...@gmail.com wrote:


 Exactly, I will have full summer to understand and get up to speed. But
 since my knowledge is very limited my proposal won't be too good.. :)


 This doesn't need to be the case. In fact it is crucial that the
 submission is of a reasonable quality. The original issue was pretty well
 discussed iirc, and additionally there is also some code uploaded by the
 original author so you could have a look at that over the next few days
 before making a crack at the submission. I can say one thing for sure
 though, this issue might need to be branded more generically... just now
 Nutch would benefit more from a generically oriented plugin for scraping
 various parts of html. The original author had a use case driven approach
 to this issue which meant he had to extract very specific content from news
 sites... this may not suit you, and certainly isn't absolutely everyone's
 cup of tea within the community. It would be great if you could discuss
 both in your application and on the Jira thread how the issue could be
 opened up, subsequently enabling more Nutch users to benefit... as you are
 stepping up to apply here, how you wish to do this is entirely your own
 choice so I would take the positives from the flexibility you have here and
 focus on them within your submission. Does this sounds reasonable?


Sounds good to me. Where can I chat with nutch-developers ? not many people
are present on IRC channel #nutch. BTW, I created a rough draft with all my
personal bio and other necessary information and submitted to
google-melange [1]. I will update the project schedule soon preferably
after having some discussions.

[1] =
http://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2012/syst3mw0rm/9001


 I look forward to seeing any progress you have and will seriously consider
 stepping up to be a potential mentor as it was me that added the issue to
 GSoC list of projects.


that would be great!!


 Thank you

 Lewis





-- 
Aamir Khan | 3rd Year  | Computer Science  Engineering | IIT Roorkee