Answers to your questions Prakash: >> Are you using your own crawler/spider? Initially used scrapy. Now just a combo of urllib2 and beautifulsoup. Beautifulsoup scraps first page, gets url. Then urllib2 gets individual pages. Then Beautifulsoup scraps the main text from returned pages. >> Is it search by url or keyword? Get headlines from main page, find associated url for it and then get main content. >> how do you extract the news title are you using Named Entity Extraction for extraction of some main info? Titles already given on eKantipur and Nagarik sites. Use Beautifulsoup to extract them. Words for the title can also be generated by doing a frequency count of words in the article and getting a combo of the most highly used words (except for the stopwords.) One caution though, sometimes gives funny result compared to titles from the site itself as titles on the site may be based on rare words. >> What is the basis of summary? Sentence clustering around the title. >> Do you use any classification or clustering technique for grouping the past similar news? find masi distance between one headline and another. I use 0.65 based on trial and error. >> Have you made any corpus for it? Why would you need a corpus for it?
Pravin Pravin -- FOSS Nepal mailing list: [email protected] http://groups.google.com/group/foss-nepal To unsubscribe, e-mail: [email protected] Mailing List Guidelines: http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines Community website: http://www.fossnepal.org/
