To me it seems best to have your fetcher "class" and then extensions that you run specific to that run and only that run.
I ran a test of fetching with a regex exclusion of everything but "http" paths (html only) and the spider is screaming - doing about 300 docs/second. I'm at the point where i don't think a content decision and processing point should be at the fetcher level but at the segment generation. Create a segment specific to the content type so you can have a spider fetch and process that content. I don't think we want to work on scaling the fetcher process with a huge backend when we have yet to have a proven scalable DB in use :) Why not just create a segment to fetch, put it on a fetcher server and run a dedicated crawler that handles that data type really well - instead of trying to broker the fetch process in run time creating what could possibly be a nightmare to load balance/control as well as performance tune and tweak per each data type being grabbed. Could the pdf fatcher rely on the core fetcher for the base process and then have its own dom/doc parser that will make better decisions on what it is seeing? This would lead to possibly more relevence of non html doc types because books/pdfs/manuals and such will most likely not have the same anchors coming in or keyword densities but they may be just as relevent when scored against other similar document types and the intermingled with search results in the same fashion distributed search results sorts them. --- Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi, > something comes in my mind just in the moment i was > deleting the light. > In case we just 'stupid' fetch and extract the > content in a second > process we fetch files we can not handle as well. > Since traffic is expansive we should only fetch file > we can handle, > right? > > I see 2 solutions the fetcher ask the content > extractor factory what > kind of mime types actually are supported. Since the > content extraction > can be done on a second machine that shares a > Network storage with the > fetcher this is may be tricky. > A other solution is to setup what kind of mime type > are allowed to > fetch. > > May be I'm wrong and we have not such a problem and > i oversee something > that already exist. > Any comments? > > Good night! > Stefan > > > --------------------------------------------------------------- > open technology: http://www.media-style.com > open source: http://www.weta-group.net > open discussion: http://www.text-mining.org > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: Oracle 10g > Get certified on the hottest thing ever to hit the > market... Oracle 10g. > Take an Oracle 10g class now, and we'll give you the > exam FREE. > http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
