Answers to your questions Prakash:
>> Are you using your own crawler/spider?
Initially used scrapy. Now just a combo of urllib2 and beautifulsoup. 
Beautifulsoup scraps first page, gets url. Then urllib2 gets individual 
pages. Then Beautifulsoup scraps the main text from returned pages.
>> Is it search by url or keyword?
Get headlines from main page, find associated url for it and then get main 
content.
>> how do you extract the news title are you using Named Entity Extraction 
for extraction of some main info?
Titles already given on eKantipur and Nagarik sites. Use Beautifulsoup to 
extract them. Words for the title can also be generated by doing a 
frequency count of words in the article and getting a combo of the most 
highly used words (except for the stopwords.) One caution though, sometimes 
gives funny result compared to titles from the site itself as titles on the 
site may be based on rare words.
>> What is the basis of summary?
Sentence clustering around the title.
>> Do you use any classification or clustering technique for grouping the 
past similar news?
find masi distance between one headline and another. I use 0.65 based on 
trial and error.  
>> Have you made any corpus for it?
Why would you need a corpus for it?

Pravin

Pravin

-- 
FOSS Nepal mailing list: [email protected]
http://groups.google.com/group/foss-nepal
To unsubscribe, e-mail: [email protected]

Mailing List Guidelines: 
http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
Community website: http://www.fossnepal.org/

Reply via email to