Hi Remiz, Sure!
Check out this 5 min writing a parser guide in Tika: https://tika.apache.org/1.7/parser_guide.html OK, so then check out Any23: http://any23.apache.org/ It has support for parsing RDF Microformats. So, you may want to create a MicroformatsParser in Tika; then if it’s supported in Tika, it will in turn be available in Nutch and its parse-tika plugin if you upgrade it to the latest version of Tika. You can see how to do this here: http://s.apache.org/fsY Cheers and best of luck - hope that’s enough to get your proposal kicked off. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Remzi Düzağaç <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, March 27, 2015 at 7:22 AM To: dev <[email protected]> Cc: "[email protected]" <[email protected]>, "[email protected]" <[email protected]> Subject: Re: GSOC RDF Microformats Support >Hi Chris, > > >Thanks for your feedback. >I was planning to use any23 and tika but I dont have detailed grasp of >both projects. I guess Im gonna need to dive in both. > > >I would appreciate if you could guide me > > >thanks > >On Fri, Mar 27, 2015 at 4:07 PM, Mattmann, Chris A (3980) ><[email protected]> wrote: > >Hi Remzi - thanks! You may want to consider this as a Tika or >Any23 project since Nutch delegates its parsing to Tika (and >Any23 uses Tika [and vice versa] to handle micro formats). > >Cheers, >Chris > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Remzi Düzağaç <[email protected]> >Reply-To: "[email protected]" <[email protected]> >Date: Friday, March 27, 2015 at 5:07 AM >To: "[email protected]" <[email protected]> >Subject: GSOC RDF Microformats Support > >>Hi Guys, >> >> >>I have sent a proposal to gsoc. I would like to add rdf microformat >>support to nutch. I kindly ask for your support. Is there anyone >>volunteer to be my mentor on this topic? >> >> >>Thank you very much >> > > > > > > > > > > >
