Hi Halil, Thanks for your response. I don't know how much contact you have had with Talat. If you've had a lot then I apologize for the following
On Thu, Jun 18, 2015 at 2:24 AM, Halil Ibrahim Simsek <[email protected]> wrote: > Hi Lewis, > Now I am working on implementing jsoup to nutch 2.x. Probably I will > finish it untill end of the tomorrow at the latest. I forked nutch 2.x to > my personal github[1]. I will commit the changes which I made on my local. > OK, there are no commit's on the branch. Am I missing something here? Have you pushed no code to your remote repos yet? I am not trying to put you on the spot here but as far as I can see there is no coding as of yet. Frankly that is worrying considering GSoC has been 'active' for a number or months. > Also I will add reports(including this week's) to my wiki page at this > weekend. > At the very beginning it was stated that this should happen every week. This way we actually MANGE the project. Right now as far as I can see there is no direction and this has been a direct result of no reporting taking place. > And next week(untill 26th June) I will write tests to newly implemented > parser. > Here's a better idea, lets get some reporting done. In parallel lets please push your code to your repository. We will take it from there. > > About Tika, > I made some research on Tika codebase. As far as I see, Tika does not have > a structure which you can choose parser to parse html depending on an > option as nutch has "parser.html.imp". > This is not correct. All you do is write your parser then register the parser here https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser I think you maybe missed my point about involving Tika here. If your parser is implemented in Tika then guess what... every other project which consumes Tika as a dependency also will get to use your HTML5 compliant parser. This is literally thousands if not tens of thousands of software projects all over the entire world. Guess what, Nutch 1.X will also be able to use the parser as well. > It uses one and only tagsoup. I may implement jsoup to Tika besides > tagsoup with a similar structure of Nutch has. But I think implementing > jsoup to Tika will be harder than implementing to Nutch since Nutch already > has a flexible structure on implementing a new html parser. > Can I please make something absolutely clear... your Google Summer of Code effort is not meant to be the easiest thing possible. It is meant to be a project which you do and are mentored through by your mentors. What you are doing (or what you describe above), I would suggest is the wrong way for for it to be done. I don't know why you have chosen this direction without consultation with Talat, myself and the community at large. If you have consulted others then again I apologize but I am sincerely confused right now as to the lack of understanding as to what the direction is here. > > My plan on implementing jsoup is, untill first review of Gsoc I will have > implemented jsoup to Nutch and all tests for newly implemented parser will > have written. > What plan? There is no plan within your proposal https://wiki.apache.org/nutch/GoogleSummerOfCode/Giving%20HTML5%20support%20for%20Apache%20Nutch%202.x i asked you to add this and you have failed to update the plan. Therefore there is no plan! If it is somewhere else then please show me. I have seen no plan from you. > And if I pass the first review then I will start working on implementing > jsoup to Tika. > As I said above, please just do the following "...lets get some reporting done. In parallel lets please push your code to your repository. We will take it from there." > I am not quite sure if I will succeed on it but I will try. > > To sum up, > - I will finish jsoup implementation to Nutch and commit it to [1] untill > end of the tomorrow at the latest > Would be great. But it is the wrong way for it to be done. > - I will add needed reports to my wiki page until end of this weekend > (21st of June) > I am puzzled as to why this can't be done within the next hour or so. It should only take about 15 minutes per report. 4 weeks X 15 minutes is an hours work. It should not take you 4 days to have this done. You are meant to be working 40 hours a week on this project. > - I will write tests for newly implemented parser in next week until 26th > of June > I would state that tests are important but that the other things are ultimately more important. Please take the above suggestions into serious consideration of you are serious about this project going forward. > > By the way I also investigated the licence issue we discussed before, > there is no problem using jsoup library(MIT licence) in Apache projects[2] > Thanks, we've been using JSoup for a long time and yes it is MIT licensed. This is a 2 minute job to find this out. Thanks for the update anyway. Please consider my comments above as supportive for this project moving forward but pretty disappointed in the current state of the project. You need to realize that as an Engineer and mentor here I would like to see code. I've not seen a thing and we are over 2 months in. Thanks Lewis

