Hi Aditya,
You can you any HTML parser if you are getting/crawling an page from wikipedia
and ignore those sections which are repetitive.
If you are using Jericho parser here is what you can do.
URL u = new URL("any english wikipedia page");
Source src = new Source(u.openConnection().getInputStream());
TextExtractor textExtractor=new TextExtractor(src) {
public boolean excludeElement(StartTag startTag) {
return startTag.getName()==HTMLElementName.HEAD
||
"printfooter".equalsIgnoreCase(startTag.getAttributeValue("class"))
||
"footer".equalsIgnoreCase(startTag.getAttributeValue("id"))
||
"references".equalsIgnoreCase(startTag.getAttributeValue("class"))
|| "infobox
sisterproject".equalsIgnoreCase(startTag.getAttributeValue("class"))
||
"siteSub".equalsIgnoreCase(startTag.getAttributeValue("id"))
||
"dablink".equalsIgnoreCase(startTag.getAttributeValue("class"))
||
"portlet".equalsIgnoreCase(startTag.getAttributeValue("class"))
||
"jump-to-nav".equalsIgnoreCase(startTag.getAttributeValue("id"))
||
"mw-hidden-cats-hidden".equalsIgnoreCase(startTag.getAttributeValue("class"))
|| "generated-sidebar
portlet".equalsIgnoreCase(startTag.getAttributeValue("class"))
;
}
};
String parsedText =
textExtractor.setIncludeAttributes(false).toString();
Though above code does not remove all the repetitve things, so you need to dig
a little more in the page to get those. If you are not crawling the wiki page
and are using XML dump, take any mediawiki parser which will give the html and
you can use the above code, but yeah it will be duplication effort.
--Thanks and Regards
Vaijanath N. Rao
----- Original Message -----
From: "Aditya" <[email protected]>
To: [email protected]
Sent: Saturday, May 2, 2009 4:19:33 PM GMT +05:30 Chennai, Kolkata, Mumbai, New
Delhi
Subject: REPOST from another list: Question related to improving search results
Hi,
New to this group.
Question:
Generally sites like wikipeadia have a template and every page follows it.
These templates contains the word that occurs in every page.
For example wikipedia template has the list of language in the left panel.
Now these words gets indexed every time since they are not (cannot be) stop
words.
if user for example search for "Galego", every wikipedia page will be in the
search result which is wrong as every wikipedia page does not talk about
"Galego"
Any takes on this one for how to solve this problem?
Best Regards,
Aditya
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]