Hi Mithun, The goldstandard.txt is a file against which the parsed text of an html page coming from nutch will be checked. There is no particular format for that file, just plain text.
For example: If you were to score pages which were more similar to a topic relating to Robotics, you would want your goldstandard.txt to contain words like Autonomous, Artificial Intelligence, Robots, etc (Or even paste the entire wikipedia article on Robotics). Hence, you could basically just put all the relevant terms in your goldstandard. But remember pasting a lot of things in the goldstandard could introduce noise. The stopwords.txt file is used to filter the irrelevant words in the vocabulary like and, the, then, etc. You could search online for a list of common stop words. You could even introduce your own. I hope the above explanation helps you to get started, I will update the wiki with an example soon. Best, Sujen Regards, Sujen Shah M.S - Computer Science (Class of 2016) University of Southern California http://www.linkedin.com/in/sujenshah On Wed, Oct 7, 2015 at 6:52 AM, Christian Alan Mattmann <[email protected]> wrote: > Sujen can you provide an example on the existing Scoring > Similarity wiki page of what the gold standard file > should have in it and how it should be formatted. > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Adjunct Associate Professor, Computer Science Department > University of Southern California > Los Angeles, CA 90089 USA > Email: [email protected] > WWW: http://sunset.usc.edu/ > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > -----Original Message----- > From: Mithun Maragiri <[email protected]> > Date: Wednesday, October 7, 2015 at 1:39 AM > To: jpluser <[email protected]> > Subject: Team 18 : Similarity scoring: goldstandard.txt, stopwords.txt > contents > > >Hello Professor, > > > > > >I am trying to implement the 7th question of scoring similarity. > >I read the material given in the link and understood how to enable it and > >how to use it. > >But I did not get what to fill in the goldstandard.txt and stopwords.txt. > >The link mentions that stopwords.txt should have stop words one per line > >but I dont know what are my stop words and what to write in > >goldstandard.txt > > > > > >Can you please tell what to write in the goldstandard and stopwords.txt > >files? Is there any example or reference which we can use to implement? > > > >Thanks, > >Mithun > > > > > > > > > > > >

