Is it possible to use the webdb (link db) to add rules to as you recursively process your data based upon link analysis?
Such as adding a field for language so you can either score based on language (using language as your identifying "top" to build your link analysis from) or simply ignoring a language/tld or some field all together from generate process (essentially not refetching or pragmatically refetching or scaling your fetch on some priority)? does google have a page rank based upon language or is it based upon entire dataset? From what i can find out temoa says there are 2 billion english only pages and the rest are non english; is that a safe assumption or just what teoma has been able to process? I ask that because it would mean simplifying your processing to manageable datasets for many languages compared to trying the "whole internet" in a single dataset. I also think it's more feasable for me to focus on english based sites as a whole since the cultural differences and laws are enough to shy me away from getting into the legal messes other nations can potentially enforce or imply :) -byron --- Byron Miller <[EMAIL PROTECTED]> wrote: > I know you can enable language detect during > index-more however is there a method to doing this > during the crawl? > > I'm interested in building an index as english only > right now. what is the theory behind that? anyone > have > any experience? > > would it be building a huge black list, ignoring > tlds > until you find a computational method or??? thoughts > anyone? >
