Re: [Nutch-dev] fetch unparse-able content

Byron Miller Wed, 19 May 2004 18:01:52 -0700

To me it seems best to have your fetcher "class" and
then extensions that you run specific to that run and
only that run.

I ran a test of fetching with a regex exclusion of
everything but "http" paths (html only) and the spider
is screaming - doing about 300 docs/second.

I'm at the point where i don't think a content
decision and processing point should be at the fetcher
level but at the segment generation.  Create a segment
specific to the content type so you can have a spider
fetch and process that content.

I don't think we want to work on scaling the fetcher
process with a huge backend when we have yet to have a
proven scalable DB in use :)

Why not just create a segment to fetch, put it on a
fetcher server and run a dedicated crawler that
handles that data type really well - instead of trying
to broker the fetch process in run time creating what
could possibly be a nightmare to load balance/control
as well as performance tune and tweak per each data
type being grabbed.

Could the pdf fatcher rely on the core fetcher for the
base process and then have its own dom/doc parser that
will make better decisions on what it is seeing?

This would lead to possibly more relevence of non html
doc types because books/pdfs/manuals and such will
most likely not have the same anchors coming in or
keyword densities but they may be just as relevent
when scored against other similar document types and
the intermingled with search results in the same
fashion distributed search results sorts them.

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi,
> something comes in my mind just in the moment i was
> deleting the light.
> In case we just 'stupid' fetch and extract the
> content in a second 
> process we fetch files we can not handle as well.
> Since traffic is expansive we should only fetch file
> we can handle, 
> right?
> 
> I see 2 solutions the fetcher ask the content
> extractor factory what 
> kind of mime types actually are supported. Since the
> content extraction 
> can be done on a second machine that shares a
> Network storage with the 
> fetcher this is may be tricky.
> A other solution is to setup what kind of mime type
> are allowed to 
> fetch.
> 
> May be I'm wrong and we have not such a problem and
> i oversee something 
> that already exist.
> Any comments?
> 
> Good night!
> Stefan
> 
> 
>
---------------------------------------------------------------
> open technology:   http://www.media-style.com
> open source:           http://www.weta-group.net
> open discussion:    http://www.text-mining.org
> 
> 
> 
>
-------------------------------------------------------
> This SF.Net email is sponsored by: Oracle 10g
> Get certified on the hottest thing ever to hit the
> market... Oracle 10g. 
> Take an Oracle 10g class now, and we'll give you the
> exam FREE.
>
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
>
https://lists.sourceforge.net/lists/listinfo/nutch-developers

-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] fetch unparse-able content

Reply via email to