We have the cluster set up and last week I had already downloaded and
preprocessed the corpus there to get a subset of mentions of my interest. I
was going to install and run Nutch myself, but other priorities came to the
top of my priority queue.

This is why I suggested that Zhiwei starts with a small set, so that he can
test in his single machine setup. If it works, we can take whatever he has
and run with a larger set on the cluster.

Cheers,
Pablo


On Tue, Apr 16, 2013 at 6:49 PM, Max Jakob <[email protected]> wrote:

> Yes, crawling from one machine is not feasible. Nutch is hence a good
> option if we really go through with extracting these mentions
> ourselves, or some other kind of parallel download because we don't
> need the crawler functionality. Common Crawl is another cool option.
> In both cases we would need some kind of OccurrenceSource from html.
> boilerpipe is already there as a dependency anyways.
>
> Maybe it is worth pinging Sameer Singh who maintains [2] to ask about
> the time line of the release of the complete context dataset? If it
> will take long, we could start with the extraction of occurrences from
> html and see if we can arrange a cluster somewhere to download the
> pages.
>
> What do you guys think?
>
> Cheers,
> Max
>
> [2] http://www.iesl.cs.umass.edu/data/wiki-links
>
>
> On Tue, Apr 16, 2013 at 11:20 AM, Joachim Daiber
> <[email protected]> wrote:
> > Ah, good that you spotted this! Well, this might take a while to crawl :)
> > Maybe we could also extract the relevant pages from the common crawl
> corpus
> > if crawling ourselves takes too long.
>



-- 

Pablo N. Mendes
http://pablomendes.com
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to