Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Pablo N. Mendes Wed, 17 Apr 2013 01:08:10 -0700

Using Nutch?

Cheers,
pablo



On Wed, Apr 17, 2013 at 5:14 AM, Cai Zhiwei <[email protected]> wrote:

> Yeah,I would like to take up this task.In fact,I've already downloaded
> hundreds of pages with my own script.I am trying on extracting part.
>
>
> Best regards,
> Zhiwei
>
>
> 2013/4/17 Max Jakob <[email protected]>
>
>> Sounds good to me.
>>
>> @Zhiwei, you could start with downloading a couple of pages that are
>> listed in the mention corpus with a script and then extract the
>> mentions (we call them occurrences) from it. I suggest you do it
>> regardless of what Google found and look for links to Wikipedia on the
>> respective pages.
>> Please let us know if you would like to take up this task.
>>
>> Cheers,
>> Max
>>
>> On Tue, Apr 16, 2013 at 7:38 PM, Pablo N. Mendes <[email protected]>
>> wrote:
>> > We have the cluster set up and last week I had already downloaded and
>> > preprocessed the corpus there to get a subset of mentions of my
>> interest. I
>> > was going to install and run Nutch myself, but other priorities came to
>> the
>> > top of my priority queue.
>> >
>> > This is why I suggested that Zhiwei starts with a small set, so that he
>> can
>> > test in his single machine setup. If it works, we can take whatever he
>> has
>> > and run with a larger set on the cluster.
>> >
>> > Cheers,
>> > Pablo
>> >
>> >
>> > On Tue, Apr 16, 2013 at 6:49 PM, Max Jakob <[email protected]> wrote:
>> >>
>> >> Yes, crawling from one machine is not feasible. Nutch is hence a good
>> >> option if we really go through with extracting these mentions
>> >> ourselves, or some other kind of parallel download because we don't
>> >> need the crawler functionality. Common Crawl is another cool option.
>> >> In both cases we would need some kind of OccurrenceSource from html.
>> >> boilerpipe is already there as a dependency anyways.
>> >>
>> >> Maybe it is worth pinging Sameer Singh who maintains [2] to ask about
>> >> the time line of the release of the complete context dataset? If it
>> >> will take long, we could start with the extraction of occurrences from
>> >> html and see if we can arrange a cluster somewhere to download the
>> >> pages.
>> >>
>> >> What do you guys think?
>> >>
>> >> Cheers,
>> >> Max
>> >>
>> >> [2] http://www.iesl.cs.umass.edu/data/wiki-links
>> >>
>> >>
>> >> On Tue, Apr 16, 2013 at 11:20 AM, Joachim Daiber
>> >> <[email protected]> wrote:
>> >> > Ah, good that you spotted this! Well, this might take a while to
>> crawl
>> >> > :)
>> >> > Maybe we could also extract the relevant pages from the common crawl
>> >> > corpus
>> >> > if crawling ourselves takes too long.
>> >
>> >
>> >
>> >
>> > --
>> >
>> > Pablo N. Mendes
>> > http://pablomendes.com
>>
>
>


-- 

Pablo N. Mendes
http://pablomendes.com

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Reply via email to