Hi, John,

It is an old message. But, I just saw it, so I will respond ;-)

My personal opinion is to better roll your own crawler. Because there seems
to be much customization you want to do with standard Nutch, or even
Heritrix. Not sure if it is worth the effort to hack the two crawlers.

I rolled my own crawler as a way to be able to crawl multiple sites at the
same time, but, using MySQL as the backend to store the content. It worked
pretty good for me. You can take a look at www.coolposting.com as a demo for
the content search based on my own crawler.

The nice thing about using MySQL as the backend is that, you just have a
single point of entry for storing the content, and you can distribute the
crawlers on multiple machines, all pointing to the same MySQL database using
jdbc calls. So, really easy to setup.

If you are interested, let me know by email and I can share some more ideas.

Jian

On 4/26/07, John Kleven <[EMAIL PROTECTED]> wrote:
>
> I guess i'm not getting much traction (apologies if this is too
> off-topic) but i'll post one more note before i quit :)
>
> I was looking at the Heritrix crawler as another potential solution
> that, like the Nutch crawler, has more powerful features than wget in
> "mirroring" mode has.
>
> Has anyone had any experiences in pros/cons of the Nutch fetcher vs
> Heritrix?  Seems like these two are the larger (largest?) open source
> crawlers that are still actively maintained and can do thinks like
> javascript link extraction, frames, avoid crawler traps, etc.
>
> John
>
>
> On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote:
> > Any other opinions on how smart/stupid it is to use the Nutch
> > crawler/fetcher exclusively without the indexer/deduper etc???  I.e.,
> > is it worth the trouble?  I just need a solid crawler to pull down
> > html pages.
> >
> > I spent some time w/ wget today including hacking in some missing
> > features (apparently there hasn't been a maintainer in a while) - it
> > seems pretty legit for mirroring.  However, unlike the nutch crawler,
> > there's no javascript link "extraction" (i know i know, its just a
> > regex).  There's also no way to say "only grab 50,000 pages max" --
> > the only control is depth level (although i'm sure i could hack that
> > in as well).  It's also missing any logic to not go down a recursive
> > html trap.
> >
> > Apparently Nutch and wget can both do frames and cookies as well ... a
> > tie there I guess.
> >
> > If anyone wants to chime in ... what would you use?  Nutch crawler
> > hacked up a bit, wget, or ... something else??  Again, i've got about
> > 3500 domains to crawl, each one is a large/dynamic site.
> >
> > Thanks all,
> > John
> >
> > On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote:
> > > Interesting idea, a few negatives:
> > >
> > > 1) have to roll your own "only hit one domain at a time" (i.e.,
> > > politeness) into it
> > > 2) No pdf/word file parsing
> > > 3) Support for browser/spider traps?  i.e., recursive loops?
> > > 4) scalability on 3000+ large domains? We're talking millions of URLs
> here.
> > > 5) No js link extraction (although i'm not sure how solid that really
> > > is on nutch anyways)
> > >
> > > Postives are wget is obviously simple ... i just assumed that Nutch
> > > fetcher would be more advanced.  Am I mistaken?
> > >
> > > I'm assuming that Nutch can do cookies and frames as well??
> > >
> > > Thanks,
> > > John
> > >
> > > On 4/25/07, Briggs <[EMAIL PROTECTED]> wrote:
> > > > If you are just looking to have a seed list of domains, and would
> like
> > > > to mirror their content for indexing, why not just use the unix tool
> > > > 'wget'?  It will mirror the site on your system and then you can
> just
> > > > index that.
> > > >
> > > >
> > > >
> > > >
> > > > On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote:
> > > > > Hello,
> > > > >
> > > > > I am hoping crawl about 3000 domains using the nutch crawler +
> > > > > PrefixURLFilter, however, I have no need to actually index the
> html.
> > > > > Ideally, I would just like each domain's raw html pages saved into
> separate
> > > > > directories.  We already have a parser that converts the HTML into
> indexes
> > > > > for our particular application.
> > > > >
> > > > > Is there a clean way to accomplish this?
> > > > >
> > > > > My current idea is to create a python script (similar to the one
> already on
> > > > > the wiki) that essentially loops through the fetch, update cycles
> until
> > > > > depth is reached, and then simply never actually does the real
> lucene
> > > > > indexing and merging.  Now, here's the "there must be a better
> way" part ...
> > > > > I would then simply execute the "bin/nutch readseg -dump" tool via
> python to
> > > > > extract all the html and headers (for each segment) and then, via
> a regex,
> > > > > save each html output back into an html file, and store it in a
> directory
> > > > > according to the domain it came from.
> > > > >
> > > > > How stupid/slow is this?  Any better ideas?  I saw someone
> previously
> > > > > mentioned something like what I want to do, and someone responded
> that it
> > > > > was better to just roll your own crawler or something?  I doubt
> that for
> > > > > some reason.  Also, in the future we'd like to take advantage of
> the
> > > > > word/pdf downloading/parsing as well.
> > > > >
> > > > > Thanks for what appears to be a great crawler!
> > > > >
> > > > > Sincerely,
> > > > > John
> > > > >
> > > >
> > > >
> > > > --
> > > > "Conscious decisions by conscious minds are what make reality real"
> > > >
> > >
> >
>

Reply via email to