Very interesting, thanks!  I'm still working out the best way to do
this for my project, and a multi pass solution might make sense.
Currently I'm using Python to fetch (urllib) and parse (BeautifulSoup)
and post to Solr (with solr.py)

On Wed, Jun 4, 2008 at 7:01 PM, James Moore <[EMAIL PROTECTED]> wrote:
> On Thu, May 29, 2008 at 11:53 PM,  <[EMAIL PROTECTED]> wrote:
>> We've successfully used:
>>
>> 1) Nutch to fetch + parse pages
>> 2) Custom Nutch2Solr indexer
>
> That's what I started out doing, but I hit a couple issues.
>
> First, turned out I needed to parse the pages myself.  The nutch
> parse_text field is useful, but tends to contain lots of navigation
> text from the pages I've been crawling.  I needed a subset of the
> page, plus more information from external sources.
>
> Right now it's a several-pass system.  Nutch to fetch the data, then a
> pass to parse the raw HTML, then a pass to turn the parsed data into
> XML suitable for feeding into solr.
>
> I don't send anything straight to solr from nutch - instead, I build
> XML files in hadoop and then after hadoop is finished, I post them to
> solr using something simple like curl.
>
> --
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>

Reply via email to