No,I use a simple python script written by myself.I only want to download
some test pages to get things start.
Best regards,
Zhiwei
2013/4/17 Pablo N. Mendes <[email protected]>
>
> Using Nutch?
>
> Cheers,
> pablo
>
>
> On Wed, Apr 17, 2013 at 5:14 AM, Cai Zhiwei <[email protected]> wrote:
>
>> Yeah,I would like to take up this task.In fact,I've already downloaded
>> hundreds of pages with my own script.I am trying on extracting part.
>>
>>
>> Best regards,
>> Zhiwei
>>
>>
>> 2013/4/17 Max Jakob <[email protected]>
>>
>>> Sounds good to me.
>>>
>>> @Zhiwei, you could start with downloading a couple of pages that are
>>> listed in the mention corpus with a script and then extract the
>>> mentions (we call them occurrences) from it. I suggest you do it
>>> regardless of what Google found and look for links to Wikipedia on the
>>> respective pages.
>>> Please let us know if you would like to take up this task.
>>>
>>> Cheers,
>>> Max
>>>
>>> On Tue, Apr 16, 2013 at 7:38 PM, Pablo N. Mendes <[email protected]>
>>> wrote:
>>> > We have the cluster set up and last week I had already downloaded and
>>> > preprocessed the corpus there to get a subset of mentions of my
>>> interest. I
>>> > was going to install and run Nutch myself, but other priorities came
>>> to the
>>> > top of my priority queue.
>>> >
>>> > This is why I suggested that Zhiwei starts with a small set, so that
>>> he can
>>> > test in his single machine setup. If it works, we can take whatever he
>>> has
>>> > and run with a larger set on the cluster.
>>> >
>>> > Cheers,
>>> > Pablo
>>> >
>>> >
>>> > On Tue, Apr 16, 2013 at 6:49 PM, Max Jakob <[email protected]>
>>> wrote:
>>> >>
>>> >> Yes, crawling from one machine is not feasible. Nutch is hence a good
>>> >> option if we really go through with extracting these mentions
>>> >> ourselves, or some other kind of parallel download because we don't
>>> >> need the crawler functionality. Common Crawl is another cool option.
>>> >> In both cases we would need some kind of OccurrenceSource from html.
>>> >> boilerpipe is already there as a dependency anyways.
>>> >>
>>> >> Maybe it is worth pinging Sameer Singh who maintains [2] to ask about
>>> >> the time line of the release of the complete context dataset? If it
>>> >> will take long, we could start with the extraction of occurrences from
>>> >> html and see if we can arrange a cluster somewhere to download the
>>> >> pages.
>>> >>
>>> >> What do you guys think?
>>> >>
>>> >> Cheers,
>>> >> Max
>>> >>
>>> >> [2] http://www.iesl.cs.umass.edu/data/wiki-links
>>> >>
>>> >>
>>> >> On Tue, Apr 16, 2013 at 11:20 AM, Joachim Daiber
>>> >> <[email protected]> wrote:
>>> >> > Ah, good that you spotted this! Well, this might take a while to
>>> crawl
>>> >> > :)
>>> >> > Maybe we could also extract the relevant pages from the common crawl
>>> >> > corpus
>>> >> > if crawling ourselves takes too long.
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > Pablo N. Mendes
>>> > http://pablomendes.com
>>>
>>
>>
>
>
> --
>
> Pablo N. Mendes
> http://pablomendes.com
>
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc