Excellent, this looks like a thorough proposal...please submit! Jon
On Mon, 2007-03-26 at 03:28 -0300, Alan Kelon Oliveira de Moraes wrote: > Hy, folks! > > Here goes my draft proposal. > > Including RDFa support in Nutch: Updating the CCNutch plug-in > ============================================================= > > Proponent: Alan Kelon Oliveira de Moraes <akom at cin.ufpe.br> > > Summary > Project proposal > Development process > Deliverables > Major milestones > Past open source projects > Resume > Final self-advocacy > > Project proposal > ================ > > RDFa is a syntax that expresses semantic in structured data using a set > of elements and attributes that embed RDF in HTML, such as a license on > a document or a photo's creator name, camera setting information and its > resolution. > > Nutch is an open-source search engine that uses Lucene for searching the > Web or in a customized form for an intranet or subset of the Web. > CCNutch is a plug-in for Nutch to search Creative Commons content. > Currently, CCNutch indexes only text documents and do not support very > well RDFa. The inclusion of RDFa could be a great improvement because > we could easily index image, audio and video through their RDFa meta-data > and search them, increasing our range of searchable artifacts available > under creative licenses, so enabling RDFa parsing and indexing in (CC)Nutch > should enable better structured search and sharing of artifacts. > > I played only a few days with Nutch and CCNuth code base. I admit that it > was not enough to be an expert on them, but they do not seem to be very > hard to extend. The first step is to add RDFa content into Nutch indexes. > I think I could use – and extend whenever necessary – the RDFa extractor > (http://sw-app.org/dev/RDFaExtractorCore.java), a Java implementation to > extract RDFa information, and drop the current parsing implementation of > CCNuth. I need more research to better evaluate this choice. Elias Torres > developed an RDFa parser and its test suit > (http://dev.torrez.us/public/2006/rdfa/) > for Python language which may help too. These two implementations will > speed up the things and learning of RDFa technology. > > The second step is to add specific query facilities to the web interface > to enable searching of multimedia content by kind (images, audio and/or > video) > with different licensing restrictions. > > Development process > ------------------- > > I will follow the core of Hukarz Software Development Process, a process > to Open Source Software Factories (see my resume below) proposed in my M.Sc. > degree. Hukarz is based on Scrum, so it is iterative and incremental like > most OSS projects, and there are some additions to pure Scrum, such as the > use of Software Configuration Management and Project Management disciplines. > Each sprint – an iteration in Scrum jargon – is a small increment of the > software that contains a little of planning at its beginning, the development > itself (design, coding and testing), and a release at its end. Each spring > will have 15 days to better track progress, to share and to discuss the > software evolution with the community. Of course, I will host the project > on [EMAIL PROTECTED] from the beginning, thus the code will be readily > available > for review. > > To further track my progress, I plan to write a little project status to > cc-devel mailing list on a weekly basis as well to contact the assigned > mentor more frequently. I will also set up a blog to write about random > thoughts, progress, and difficulties faced during the development. > > Deliverables > ------------ > > These are the artifacts I will deliver during the summer: > * Requirements document: details, in a high-level abstraction level, > the features to be implemented; > * Architecture document: contains the technical solution to implement > the defined requirements; > * Project plan: a high-level plan of the development, including a > scheduling (see Major milestones below) and risk assessment; > * Sprint plan: for each iteration, a 15-day planning identifying the > week milestones will be made to guide the implementation; > * Update the CCNutch wiki page (http://wiki.creativecommons.org/CcNutch); > * A Patch to Nutch community with RDFa support in CCNutch plugin; > * A Patch to RDFa extractor if necessary. > > Major milestones > ---------------- > > * March 26: Project proposal; > * April 11: Google announces the list of accepted student projects; > > * April 19: Draft version of requirements document (this document will > be updated when necessary); > * April 27: Draft version of Architecture document (this document will > be updated when necessary); > > * May 28: Project and mentoring start; > * June 11: 1st release: > * Requirements document; > * Architecture document; > * Prototype for proof-of-concepts; > * June 25: 2nd release: > * Alpha release 1: indexing RDFa documents; > * July 9: 3rd release: > * Alpha release 2: finish of RDFa indexing and start of RDFa > searching; > * Google’s mid-term checkpoint; > > * July 16: > * Mid-term mentor evaluation deadline; > * Revised documentation (really close to final I expect); > * July 23: 4th release: > * Beta release 1: Search RDFa indexes; > * August 6: 5th release: > * Beta release 2; > * August 20: 6th (final) release; > * Google’s final checkpoint; > > * August 31: Final mentor evaluation deadline. > > > ... To see my experience with Open Source, please check > http://www.cin.ufpe.br/~akom/soc07-proposal/soc07-proposal.txt (or > http://www.cin.ufpe.br/~akom/soc07-proposal/soc07-proposal.pdf for a fancy > version ;) > > > Best regards, -- Jon Phillips San Francisco, CA USA PH 510.499.0894 [EMAIL PROTECTED] http://www.rejon.org MSN, AIM, Yahoo Chat: kidproto Jabber Chat: [EMAIL PROTECTED] IRC: [EMAIL PROTECTED] _______________________________________________ cc-devel mailing list [email protected] http://lists.ibiblio.org/mailman/listinfo/cc-devel
