Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Cai Zhiwei Sat, 20 Apr 2013 04:52:01 -0700

Hi Pablo,Max,Jo,

Good news!We don't have to crawl pages anymore.I tried to contact Sameer
according to Max's suggestion and he told me that their code to deal with
google corpus and expanded corpus with context had released
here<https://code.google.com/p/wiki-link/>
.




Best regards,
Zhiwei


On Fri, Apr 19, 2013 at 11:40 PM, Cai Zhiwei <[email protected]> wrote:

> Hi,
>
> 1.I've finished extracting wikipedia links and text from html file of
> google corpus.I store the context and links in the format such as the one I
> attach at the end,where [[A|B]] means a link to wikipedia A pages labeled
> B.I only store sentences containing links to wikipedia.Do we need more
> context information?
> *
> *
> 2.I think we don't need to generate the final format of spotlight but only
> need to generate some intermediate format which  pignlproc or lucene-core
> indexer can deal with.I think this format is getting close as it's similar
> to wikipedia dumps.But I'm not sure and don't know how to go on because I
> still have some difficulty in understanding some details of index core.Or
> maybe I can transform all these pages into wikipedia dumps format?
>
> *
> *
>
> *"*
> *Transactional Lock Elision and Priority BoostingReal-time applications
> that use locking can be subject to priority inversion
> [[Priority_inversion|priority inversion]] , where a low-priority thread
> holding a lock is preempted by a medium-priority CPU-bound thread.
> *
> *
> *
> *Beef straganoff [[Beef_Stroganoff|Beef straganoff ]] which I wanted to
> order but didn't.
> *
> *"...*
> *
> *
>
> Best regards,
> Zhiwei
>
>
> 2013/4/17 Cai Zhiwei <[email protected]>
>
>> No,I use a simple python script written by myself.I only want to download
>> some test pages to get things start.
>>
>>
>> Best regards,
>> Zhiwei
>>
>>
>> 2013/4/17 Pablo N. Mendes <[email protected]>
>>
>>
>>> Using Nutch?
>>>
>>> Cheers,
>>> pablo
>>>
>>>
>>> On Wed, Apr 17, 2013 at 5:14 AM, Cai Zhiwei <[email protected]>wrote:
>>>
>>>> Yeah,I would like to take up this task.In fact,I've already downloaded
>>>> hundreds of pages with my own script.I am trying on extracting part.
>>>>
>>>>
>>>> Best regards,
>>>> Zhiwei
>>>>
>>>>
>>>> 2013/4/17 Max Jakob <[email protected]>
>>>>
>>>>> Sounds good to me.
>>>>>
>>>>> @Zhiwei, you could start with downloading a couple of pages that are
>>>>> listed in the mention corpus with a script and then extract the
>>>>> mentions (we call them occurrences) from it. I suggest you do it
>>>>> regardless of what Google found and look for links to Wikipedia on the
>>>>> respective pages.
>>>>> Please let us know if you would like to take up this task.
>>>>>
>>>>> Cheers,
>>>>> Max
>>>>>
>>>>> On Tue, Apr 16, 2013 at 7:38 PM, Pablo N. Mendes <
>>>>> [email protected]> wrote:
>>>>> > We have the cluster set up and last week I had already downloaded and
>>>>> > preprocessed the corpus there to get a subset of mentions of my
>>>>> interest. I
>>>>> > was going to install and run Nutch myself, but other priorities came
>>>>> to the
>>>>> > top of my priority queue.
>>>>> >
>>>>> > This is why I suggested that Zhiwei starts with a small set, so that
>>>>> he can
>>>>> > test in his single machine setup. If it works, we can take whatever
>>>>> he has
>>>>> > and run with a larger set on the cluster.
>>>>> >
>>>>> > Cheers,
>>>>> > Pablo
>>>>> >
>>>>> >
>>>>> > On Tue, Apr 16, 2013 at 6:49 PM, Max Jakob <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> Yes, crawling from one machine is not feasible. Nutch is hence a
>>>>> good
>>>>> >> option if we really go through with extracting these mentions
>>>>> >> ourselves, or some other kind of parallel download because we don't
>>>>> >> need the crawler functionality. Common Crawl is another cool option.
>>>>> >> In both cases we would need some kind of OccurrenceSource from html.
>>>>> >> boilerpipe is already there as a dependency anyways.
>>>>> >>
>>>>> >> Maybe it is worth pinging Sameer Singh who maintains [2] to ask
>>>>> about
>>>>> >> the time line of the release of the complete context dataset? If it
>>>>> >> will take long, we could start with the extraction of occurrences
>>>>> from
>>>>> >> html and see if we can arrange a cluster somewhere to download the
>>>>> >> pages.
>>>>> >>
>>>>> >> What do you guys think?
>>>>> >>
>>>>> >> Cheers,
>>>>> >> Max
>>>>> >>
>>>>> >> [2] http://www.iesl.cs.umass.edu/data/wiki-links
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Apr 16, 2013 at 11:20 AM, Joachim Daiber
>>>>> >> <[email protected]> wrote:
>>>>> >> > Ah, good that you spotted this! Well, this might take a while to
>>>>> crawl
>>>>> >> > :)
>>>>> >> > Maybe we could also extract the relevant pages from the common
>>>>> crawl
>>>>> >> > corpus
>>>>> >> > if crawling ourselves takes too long.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> >
>>>>> > Pablo N. Mendes
>>>>> > http://pablomendes.com
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Pablo N. Mendes
>>> http://pablomendes.com
>>>
>>
>>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Reply via email to