Interesting idea, a few negatives: 1) have to roll your own "only hit one domain at a time" (i.e., politeness) into it 2) No pdf/word file parsing 3) Support for browser/spider traps? i.e., recursive loops? 4) scalability on 3000+ large domains? We're talking millions of URLs here. 5) No js link extraction (although i'm not sure how solid that really is on nutch anyways)
Postives are wget is obviously simple ... i just assumed that Nutch fetcher would be more advanced. Am I mistaken? I'm assuming that Nutch can do cookies and frames as well?? Thanks, John On 4/25/07, Briggs <[EMAIL PROTECTED]> wrote: > If you are just looking to have a seed list of domains, and would like > to mirror their content for indexing, why not just use the unix tool > 'wget'? It will mirror the site on your system and then you can just > index that. > > > > > On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote: > > Hello, > > > > I am hoping crawl about 3000 domains using the nutch crawler + > > PrefixURLFilter, however, I have no need to actually index the html. > > Ideally, I would just like each domain's raw html pages saved into separate > > directories. We already have a parser that converts the HTML into indexes > > for our particular application. > > > > Is there a clean way to accomplish this? > > > > My current idea is to create a python script (similar to the one already on > > the wiki) that essentially loops through the fetch, update cycles until > > depth is reached, and then simply never actually does the real lucene > > indexing and merging. Now, here's the "there must be a better way" part ... > > I would then simply execute the "bin/nutch readseg -dump" tool via python to > > extract all the html and headers (for each segment) and then, via a regex, > > save each html output back into an html file, and store it in a directory > > according to the domain it came from. > > > > How stupid/slow is this? Any better ideas? I saw someone previously > > mentioned something like what I want to do, and someone responded that it > > was better to just roll your own crawler or something? I doubt that for > > some reason. Also, in the future we'd like to take advantage of the > > word/pdf downloading/parsing as well. > > > > Thanks for what appears to be a great crawler! > > > > Sincerely, > > John > > > > > -- > "Conscious decisions by conscious minds are what make reality real" > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
