Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Cai Zhiwei Fri, 19 Apr 2013 08:40:58 -0700

Hi,

1.I've finished extracting wikipedia links and text from html file of
google corpus.I store the context and links in the format such as the one I
attach at the end,where [[A|B]] means a link to wikipedia A pages labeled
B.I only store sentences containing links to wikipedia.Do we need more
context information?
*
*
2.I think we don't need to generate the final format of spotlight but only
need to generate some intermediate format which  pignlproc or lucene-core
indexer can deal with.I think this format is getting close as it's similar
to wikipedia dumps.But I'm not sure and don't know how to go on because I
still have some difficulty in understanding some details of index core.Or
maybe I can transform all these pages into wikipedia dumps format?


*
*

*"*
*Transactional Lock Elision and Priority BoostingReal-time applications
that use locking can be subject to priority inversion
[[Priority_inversion|priority inversion]] , where a low-priority thread
holding a lock is preempted by a medium-priority CPU-bound thread.
*
*
*
*Beef straganoff [[Beef_Stroganoff|Beef straganoff ]] which I wanted to
order but didn't.
*
*"...*
*
*

Best regards,
Zhiwei


2013/4/17 Cai Zhiwei <[email protected]>

> No,I use a simple python script written by myself.I only want to download
> some test pages to get things start.
>
>
> Best regards,
> Zhiwei
>
>
> 2013/4/17 Pablo N. Mendes <[email protected]>
>
>
>> Using Nutch?
>>
>> Cheers,
>> pablo
>>
>>
>> On Wed, Apr 17, 2013 at 5:14 AM, Cai Zhiwei <[email protected]>wrote:
>>
>>> Yeah,I would like to take up this task.In fact,I've already downloaded
>>> hundreds of pages with my own script.I am trying on extracting part.
>>>
>>>
>>> Best regards,
>>> Zhiwei
>>>
>>>
>>> 2013/4/17 Max Jakob <[email protected]>
>>>
>>>> Sounds good to me.
>>>>
>>>> @Zhiwei, you could start with downloading a couple of pages that are
>>>> listed in the mention corpus with a script and then extract the
>>>> mentions (we call them occurrences) from it. I suggest you do it
>>>> regardless of what Google found and look for links to Wikipedia on the
>>>> respective pages.
>>>> Please let us know if you would like to take up this task.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>> On Tue, Apr 16, 2013 at 7:38 PM, Pablo N. Mendes <[email protected]>
>>>> wrote:
>>>> > We have the cluster set up and last week I had already downloaded and
>>>> > preprocessed the corpus there to get a subset of mentions of my
>>>> interest. I
>>>> > was going to install and run Nutch myself, but other priorities came
>>>> to the
>>>> > top of my priority queue.
>>>> >
>>>> > This is why I suggested that Zhiwei starts with a small set, so that
>>>> he can
>>>> > test in his single machine setup. If it works, we can take whatever
>>>> he has
>>>> > and run with a larger set on the cluster.
>>>> >
>>>> > Cheers,
>>>> > Pablo
>>>> >
>>>> >
>>>> > On Tue, Apr 16, 2013 at 6:49 PM, Max Jakob <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> Yes, crawling from one machine is not feasible. Nutch is hence a good
>>>> >> option if we really go through with extracting these mentions
>>>> >> ourselves, or some other kind of parallel download because we don't
>>>> >> need the crawler functionality. Common Crawl is another cool option.
>>>> >> In both cases we would need some kind of OccurrenceSource from html.
>>>> >> boilerpipe is already there as a dependency anyways.
>>>> >>
>>>> >> Maybe it is worth pinging Sameer Singh who maintains [2] to ask about
>>>> >> the time line of the release of the complete context dataset? If it
>>>> >> will take long, we could start with the extraction of occurrences
>>>> from
>>>> >> html and see if we can arrange a cluster somewhere to download the
>>>> >> pages.
>>>> >>
>>>> >> What do you guys think?
>>>> >>
>>>> >> Cheers,
>>>> >> Max
>>>> >>
>>>> >> [2] http://www.iesl.cs.umass.edu/data/wiki-links
>>>> >>
>>>> >>
>>>> >> On Tue, Apr 16, 2013 at 11:20 AM, Joachim Daiber
>>>> >> <[email protected]> wrote:
>>>> >> > Ah, good that you spotted this! Well, this might take a while to
>>>> crawl
>>>> >> > :)
>>>> >> > Maybe we could also extract the relevant pages from the common
>>>> crawl
>>>> >> > corpus
>>>> >> > if crawling ourselves takes too long.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> > Pablo N. Mendes
>>>> > http://pablomendes.com
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Pablo N. Mendes
>> http://pablomendes.com
>>
>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Woring on Idea "Generalize input formats and add support for Google mention corpus"

Reply via email to