Hi Ian, Thanks for sharing your work and experience. Do you use a fixed set of sites and data formats or extensions for data extraction or can you also discover new data casts on the web?
Cheers, -----Original message----- > From:Ian Truslove <[email protected]> > Sent: Wed 18-Jul-2012 17:03 > To: Mattmann, Chris A (388J) <[email protected]>; > <[email protected]> <[email protected]> > Cc: Ruth Duerr <[email protected]> > Subject: Re: Apache Nutch being used at National Snow and Ice Data Center: > ESIP Federation > > Chris: message received - I signed up :) > > As part of Ruth's Libre project (http://nsidc.org/libre/) we are using > Nutch to find various types of XML data. We're targeting our search at > geospatial data, and more specifically cryospheric data, but the tools > will remain more broadly applicable. Specifically we are looking for ESIP > data casts, collection casts, service casts, and ESIP Discovery OpenSearch > services (all the specs are in > http://wiki.esipfed.org/index.php/Discovery_Cluster). These XML documents > and services are characterizable through fairly simple means such as XML > namespaces. > > We are currently developing against the Nutch 1.4 tarball distribution > (SVN HEAD was moving quicker than our configuration could keep up with) > and plugging into a standalone Solr instance. > > What we have done to date is do some basic configuration work, set the > code up to play nice(-ish) with Eclipse, our internal SVN, and our > CI/deployment system, and write some plugins to help us find our various > XML docs. We wrote a pair to extract and index the full raw XML content > of the source document, extending the HtmlParseFilter and IndexingFilter > respectively. XML (and of course HTML too) are just wrapped within a > CDATA section (and CDATA sections within the document are just removed), > and indexed as a big text blob in Solr. We can do naive text matching and > are having success extracting the URLs of the data feeds we're after. > > We also wrote a pair of plugins to keep track of the original index date > of a document (the overarching use case is to determine documents that are > newly found). We used the ScoringFilter and IndexingFilter for those. > > Planned work includes extracting data from the XML before indexing and > using Solr fields more effectively, indexing GCMD keywords, simple spatial > subsetting, and tweaking the ranking algorithms to do a broad search to > identify good sites for deep data searches. > > Thanks for the interest - it's been a fun project to work on so far, and > I'm sure we'd be happy to talk more or provide more details. > > -Ian. > > > > -- > Ian Truslove > Senior Software Engineer > National Snow and Ice Data Center > University of Colorado > 449 UCB, Boulder, CO 80309 > > > > > > > On 7/17/12 9:38 PM, "Mattmann, Chris A (388J)" > <[email protected]> wrote: > > >Hi Markus, > > > >Great question. I am CC'ing Ruth Duerr and Ian Truslove and Ruth Duerr at > >NSIDC -- maybe they > >can provide more information? > > > >Ruth, ian, please consider subcribing to [email protected] and/or > >[email protected] > >by sending blank emails to: > > > >[email protected] > >[email protected] > > > >To follow along in the conversation. > > > >Thanks all! > > > >Cheers, > >Chris > > > >On Jul 17, 2012, at 5:27 PM, Markus Jelsma wrote: > > > >> Cool! > >> > >> What are they exactly doing with Apache Nutch? And, more interesting, > >>what non-standard stuff do they use? > >> > >> Cheers > >> > >> -----Original message----- > >>> From:Mattmann, Chris A (388J) <[email protected]> > >>> Sent: Tue 17-Jul-2012 21:29 > >>> To: [email protected] > >>> Subject: Apache Nutch being used at National Snow and Ice Data Center: > >>>ESIP Federation > >>> > >>> Hey Folks, > >>> > >>> Ruth Duerr is presenting at today's ESIP Federation and Discovery > >>>Hackathon: > >>> > >>> http://commons.esipfed.org/node/424 > >>> > >>> The U.S. National Snow and Ice Data Center (NSIDC) is deploying Apache > >>>Nutch and > >>> Solr to support discovery of datasets (called "casting"). > >>> > >>> Really interesting stuff, and worth contacting Ruth and NSIDC if > >>>you're interested. > >>> I'm highly suggesting to to the NSIDC folks to try and contribute any > >>>updates or plugins > >>> they are making to the software upstream here to the ASF. > >>> > >>> Thanks! > >>> > >>> Cheers, > >>> Chris > >>> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> Chris Mattmann, Ph.D. > >>> Senior Computer Scientist > >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>> Office: 171-266B, Mailstop: 171-246 > >>> Email: [email protected] > >>> WWW: http://sunset.usc.edu/~mattmann/ > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> Adjunct Assistant Professor, Computer Science Department > >>> University of Southern California, Los Angeles, CA 90089 USA > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> > >>> > > > > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Chris Mattmann, Ph.D. > >Senior Computer Scientist > >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >Office: 171-266B, Mailstop: 171-246 > >Email: [email protected] > >WWW: http://sunset.usc.edu/~mattmann/ > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Adjunct Assistant Professor, Computer Science Department > >University of Southern California, Los Angeles, CA 90089 USA > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >

