[Nutch-general] Re: Limiting search/crawl to specific language

Byron Miller Tue, 03 Jan 2006 21:26:08 -0800

Is it possible to use the webdb (link db) to add rules
to as you recursively process your data based upon
link analysis?

Such as adding a field for language so you can either
score based on language (using language as your
identifying "top" to build your link analysis from) or
simply ignoring a language/tld or some field all
together from generate process (essentially not
refetching or pragmatically refetching or scaling your
fetch on some priority)?

does google have a page rank based upon language or is
it based upon entire dataset?  From what i can find
out temoa says there are 2 billion english only pages
and the rest are non english; is that a safe
assumption or just what teoma has been able to
process? I ask that because it would mean simplifying
your processing to manageable datasets for many
languages compared to trying the "whole internet" in a
single dataset.

I also think it's more feasable for me to focus on
english based sites as a whole since the cultural
differences and laws are enough to shy me away from
getting into the legal messes other nations can
potentially enforce or imply :)

-byron

--- Byron Miller <[EMAIL PROTECTED]> wrote:

> I know you can enable language detect during
> index-more however is there a method to doing this
> during the crawl?
> 
> I'm interested in building an index as english only
> right now. what is the theory behind that? anyone
> have
> any experience?
> 
> would it be building a huge black list, ignoring
> tlds
> until you find a computational method or??? thoughts
> anyone?
> 

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Limiting search/crawl to specific language

Reply via email to