Re: Limiting search/crawl to specific language

Byron Miller Tue, 03 Jan 2006 21:25:20 -0800

Is it possible to use the webdb (link db) to add rules
to as you recursively process your data based upon
link analysis?

Such as adding a field for language so you can either
score based on language (using language as your
identifying "top" to build your link analysis from) or
simply ignoring a language/tld or some field all
together from generate process (essentially not
refetching or pragmatically refetching or scaling your
fetch on some priority)?

does google have a page rank based upon language or is
it based upon entire dataset?  From what i can find
out temoa says there are 2 billion english only pages
and the rest are non english; is that a safe
assumption or just what teoma has been able to
process? I ask that because it would mean simplifying
your processing to manageable datasets for many
languages compared to trying the "whole internet" in a
single dataset.

I also think it's more feasable for me to focus on
english based sites as a whole since the cultural
differences and laws are enough to shy me away from
getting into the legal messes other nations can
potentially enforce or imply :)

-byron

--- Byron Miller <[EMAIL PROTECTED]> wrote:

> I know you can enable language detect during
> index-more however is there a method to doing this
> during the crawl?
> 
> I'm interested in building an index as english only
> right now. what is the theory behind that? anyone
> have
> any experience?
> 
> would it be building a huge black list, ignoring
> tlds
> until you find a computational method or??? thoughts
> anyone?
>

Re: Limiting search/crawl to specific language

Reply via email to