On Nov 5, 2007 11:11 PM, Emmanuel <[EMAIL PROTECTED]> wrote: > Template/Menu Detection > > After reading many different papers about this serious problem, i decided > to > implement a simple way to eliminate every noisy information conatins > within > a web page for the same web site. > I would like to share my view with you and get some feedback or even some > ideas. > > # Problem: > We have too many noisy information stored in the index which can reduce > the > relevance and accuracy of query results. > Beside, links contained within those templates are used in the scoring > algorithm thus making less relevant the scoring of web page.
I also have done some study on this issue,and the result is satisfying.Justshare some expriences here. There are many papers about so called "main text extraction" can be found on the web.Below is an exapmle: http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/ > > #Concept: > Divide the HTML in chunk and consider noisy element all items which are > seen > over a specific ratio. Those items could be either an URL or a TEXT > *Definition of a TEXT items: I've seen many differents approach, for this > first version i choose the simpliest way which is to extract the text > between 2 HTML tags. > > *Definition of a URL items: it consist of all outlink links and anchor > within a webpage. Seems text and URLs can be extracted by current DcomUtils right ? > > *Ratio: We could have 2 ratio, one for URL items and another one for TEXT > items. All items which are count above this limit will be defined as > noisy. > For this first version i defined my URL and TEXT ratio to 70% of the total > page for the same hostname. Not clear with the method you described. but seems we used similar way. Firstly,We build a Dom tree of the Html pages using html parsers like Neko, and then divide the DOM tree into several main parts by <Table>,<Div> ,then cacultae the ratio between URLs/TextLength for each main parts. we set a ration limit, the main parts who exceed the limit will be discarded. > > > #Nutch Object Impact: > * Creation of a new object named "Noise": it contains all text chunk to be > removed from the index and all urls to avoid in the scoring operation > * Creation of a new object named "ContentToFilter": it contains a Noise > Object and a Content object. It will be use in the parsing process to > remove > noisy information. > * Improvement of the "Outlink" object to add a new boolean parameter named > "useForScoring" which will determine if this link will be use in the > scoring > operation > > #Nutch Operation: > 1- Fetch a segment without parsing > > 2- Create a new process (i will named it NoiseDetector) to parse the > segment > in order to extract all chunk data contained within the web page regarding > a > specific algorithm (as discussed above). This step will be done using a > map/reduce operation. > =>> MAP input segment <url, content> > -> Implementation: It is almost similar to a parsing operation except that > we will use a new method in DOMComUtils to extract textChunk and all > outlinks from the HTML content. > We will have 2 types of output either URL or TEXT. > -> Output <"website hostname", "TEXT textchunkextractfromthewebpage"> > -> Output <"website hostname", "URL urlextractfromthe webpage"> > =>> REDUCE <"website hostname", "textchunk/url list"> > -> Implementation: It will store and count all textChunk in one Map and > all > url in another map. We will define a ratio which will trigger if a > texchunk > or an url is considering to be noisy. Then we will output the > corresponding > textchunk and url in an object of type Noise for each website hostname and > store them on the disk in a new folder named "Filter". > -> Output <"website hostname", "noise object"> > > 3- Create a new process (I will call it ParseFilterSegment) to parse the > segment and removed all noisy items. This will use 2 Map/Reduce > operations. > First M/R which consist of associate a Noise to a Content. > =>> MAP input segment, noise <url,content/noise> > ->Implementation: This will extract the hostname of each content and ouput > it with the correspondign content. > ->Output < "website hostname", "content or noise"> > =>> REDUCE < "website hostname", "content or noise"> > ->Implementation: It will associate a Noise Object to each Content within > a > ContentToFilter object. > ->Output < "url of content", "ContentToFilter"> > Second M/R: which consist to parse the content and remove noisy elements > =>> MAP input <url,"ContentToFilter"> > ->Implementation: This will parse the segment and eliminate the noisy > element from the Outlinks list and text to be indexed. Then all data will > be > store on the disk using the usual way. > ->Output < url", "NutchWritable"> > Not clear why your operations are so complicated.I just add new mehod like extractMainText( ) in DcomUtils,and set the parseText to use the main text extracted, other information like title,outlinks in HtmlParser remain unchanged. > > > #Suggestion: > We should make DomComUtils more customizable. I mean we should allow > everybody to have his own implementation in order to extract or filter > specific information within a webpage. For instance, I want extract every > text within the page except for SELECT tags or if later i want o change my > textChunk algorithm to extract more information. > So i would suggest to make this object abstract with a default > implementation and we could extend this object to define a new > implementation and define it in the config file. > Don't you think ? > > I would appreciate your comments, suggestion or ideas of this > implementation. I think it could be useful for Nutch community. > > E >
