Yeah,I would like to take up this task.In fact,I've already downloaded
hundreds of pages with my own script.I am trying on extracting part.


Best regards,
Zhiwei


2013/4/17 Max Jakob <[email protected]>

> Sounds good to me.
>
> @Zhiwei, you could start with downloading a couple of pages that are
> listed in the mention corpus with a script and then extract the
> mentions (we call them occurrences) from it. I suggest you do it
> regardless of what Google found and look for links to Wikipedia on the
> respective pages.
> Please let us know if you would like to take up this task.
>
> Cheers,
> Max
>
> On Tue, Apr 16, 2013 at 7:38 PM, Pablo N. Mendes <[email protected]>
> wrote:
> > We have the cluster set up and last week I had already downloaded and
> > preprocessed the corpus there to get a subset of mentions of my
> interest. I
> > was going to install and run Nutch myself, but other priorities came to
> the
> > top of my priority queue.
> >
> > This is why I suggested that Zhiwei starts with a small set, so that he
> can
> > test in his single machine setup. If it works, we can take whatever he
> has
> > and run with a larger set on the cluster.
> >
> > Cheers,
> > Pablo
> >
> >
> > On Tue, Apr 16, 2013 at 6:49 PM, Max Jakob <[email protected]> wrote:
> >>
> >> Yes, crawling from one machine is not feasible. Nutch is hence a good
> >> option if we really go through with extracting these mentions
> >> ourselves, or some other kind of parallel download because we don't
> >> need the crawler functionality. Common Crawl is another cool option.
> >> In both cases we would need some kind of OccurrenceSource from html.
> >> boilerpipe is already there as a dependency anyways.
> >>
> >> Maybe it is worth pinging Sameer Singh who maintains [2] to ask about
> >> the time line of the release of the complete context dataset? If it
> >> will take long, we could start with the extraction of occurrences from
> >> html and see if we can arrange a cluster somewhere to download the
> >> pages.
> >>
> >> What do you guys think?
> >>
> >> Cheers,
> >> Max
> >>
> >> [2] http://www.iesl.cs.umass.edu/data/wiki-links
> >>
> >>
> >> On Tue, Apr 16, 2013 at 11:20 AM, Joachim Daiber
> >> <[email protected]> wrote:
> >> > Ah, good that you spotted this! Well, this might take a while to crawl
> >> > :)
> >> > Maybe we could also extract the relevant pages from the common crawl
> >> > corpus
> >> > if crawling ourselves takes too long.
> >
> >
> >
> >
> > --
> >
> > Pablo N. Mendes
> > http://pablomendes.com
>
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to