Template/Menu Detection

After reading many different papers about this serious problem, i decided to
implement a simple way to eliminate every noisy information conatins within
a web page for the same web site.
I would like to share my view with you and get some feedback or even some
ideas.

# Problem:
We have too many noisy information stored in the index which can reduce the
relevance and accuracy of query results.
Beside, links contained within those templates are used in the scoring
algorithm thus making less relevant the scoring of web page.

#Concept:
Divide the HTML in chunk and consider noisy element all items which are seen
over a specific ratio. Those items could be either an URL or a TEXT
*Definition of a TEXT items: I've seen many differents approach, for this
first version i choose the simpliest way which is to extract the text
between 2 HTML tags.
*Definition of a URL items: it consist of all outlink links and anchor
within a webpage.
*Ratio: We could have 2 ratio, one for URL items and another one for TEXT
items. All items which are count above this limit will be defined as noisy.
For this first version i defined my URL and TEXT ratio to 70% of the total
page for the same hostname.

#Nutch Object Impact:
* Creation of a new object named "Noise": it contains all text chunk to be
removed from the index and all urls to avoid in the scoring operation
* Creation of a new object named "ContentToFilter": it contains a Noise
Object and a Content object. It will be use in the parsing process to remove
noisy information.
* Improvement of the "Outlink" object to add a new boolean parameter named
"useForScoring" which will determine if this link will be use in the scoring
operation

#Nutch Operation:
1- Fetch a segment without parsing

2- Create a new process (i will named it NoiseDetector) to parse the segment
in order to extract all chunk data contained within the web page regarding a
specific algorithm (as discussed above). This step will be done using a
map/reduce operation.
=>> MAP input segment <url, content>
-> Implementation: It is almost similar to a parsing operation except that
we will use a new method in DOMComUtils to extract textChunk and all
outlinks from the HTML content.
We will have 2 types of output either URL or TEXT.
-> Output <"website hostname", "TEXT textchunkextractfromthewebpage">
-> Output <"website hostname", "URL urlextractfromthe webpage">
=>> REDUCE <"website hostname", "textchunk/url list">
-> Implementation: It will store and count all textChunk in one Map and all
url in another map. We will define a ratio which will trigger if a texchunk
or an url is considering to be noisy. Then we will output the corresponding
textchunk and url in an object of type Noise for each website hostname and
store them on the disk in a new folder named "Filter".
-> Output <"website hostname", "noise object">

3- Create a new process (I will call it ParseFilterSegment) to parse the
segment and removed all noisy items. This will use 2 Map/Reduce operations.
First M/R which consist of associate a Noise to a Content.
=>> MAP input segment, noise <url,content/noise>
->Implementation: This will extract the hostname of each content and ouput
it with the correspondign content.
->Output < "website hostname", "content or noise">
=>> REDUCE  < "website hostname", "content or noise">
->Implementation: It will associate a Noise Object to each Content within a
ContentToFilter object.
->Output < "url of content", "ContentToFilter">
Second M/R: which consist to parse the content and remove noisy elements
=>> MAP input <url,"ContentToFilter">
->Implementation: This will parse the segment and eliminate the noisy
element from the Outlinks list and text to be indexed. Then all data will be
store on the disk using the usual way.
->Output < url", "NutchWritable">

#Suggestion:
We should make DomComUtils more customizable. I mean we should allow
everybody to have his own implementation in order to extract or filter
specific information within a webpage. For instance, I want extract every
text within the page except for SELECT tags or if later i want o change my
textChunk algorithm to extract more information.
So i would suggest to make this object abstract  with a default
implementation and we could extend this object to define a new
implementation and define it in the config file.
Don't you think ?

I would appreciate your comments, suggestion or ideas of this
implementation. I think it could be useful for Nutch community.

E

Reply via email to