Kasun,

the SkosCategoriesExtractor will produce the file I mentioned some mails
ago [1]
This is why I don't think we need another extractor to process category
data from wikipedia articles, unless the current code is wrong of course :)

I personally think you could start working directly on the skos hierarchy
and focus on the interesting part of your investigation, i.e. leaves and
parents-with-leaves-only-as-children

WDYT? Mentors?

Regards
Andrea

[1] http://wiki.dbpedia.org/Downloads38#categories-skos


2013/6/27 kasun perera <[email protected]>

>
> Hi Andrea
>
> On Thu, Jun 27, 2013 at 8:43 PM, Andrea Di Menna <[email protected]>wrote:
>
>>
>> [1] org.dbpedia.extraction.mappings.SkosCategoriesExtractor
>>
>
> Actually I didn't know it was there, that's why I didn't try/mention it.
> If it is the correct option I would use it :)
>
> thanks
>
>
>
>
>>
>>
>> 2013/6/27 kasun perera <[email protected]>
>>
>>>
>>> On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio <[email protected]
>>> > wrote:
>>>
>>>>  Dear Kasun,
>>>> I had to deal with the same problem some months ago, and I managed to
>>>> use the XML article file: you can intercept categories using the
>>>> "Category:" prefix, and you can infer father-son relation using the <title>
>>>> tag (if the <title> starts with "Category:", all the categories for this
>>>> page are possible ancestors).
>>>> The Wikipedia category taxonomy is quite a mess, so good luck!
>>>>
>>>> Alessio
>>>>
>>>
>>>
>>> Hi Alessio
>>>
>>> Yes I would try this.Seems this is a good option. I hope this is the
>>> correct file 
>>> "enwiki-20130604-pages-articles.xml.bz2<http://dumps.wikimedia.org/enwiki/20130604/enwiki-20130604-pages-articles.xml.bz2>
>>>  9.2 GB"  that you are refering.
>>>
>>> @marco
>>> Is it good idea to try several options (1-What I have said in previously
>>> and 2-Aleseio's suggestion 3- any other option) and do some evaluation to
>>> find out what is best method for getting leaf nodes? May be it would give
>>> the same output?
>>>
>>> Thanks
>>>
>>>
>>>>
>>>> Il 27/06/13 05:24, kasun perera ha scritto:
>>>>
>>>>  As discussed with Marco these are the next tasks that i would be
>>>> working.
>>>>
>>>>  1. Identification of leaf categories
>>>> 2. Prominent leaves discovery
>>>> 3. Pages clustering based on prominent leaves
>>>>
>>>>  For above task 1, I'm planing to use Wikipedia category and
>>>> category_links SQL tables available here.
>>>> http://dumps.wikimedia.org/enwiki/20130604/
>>>>
>>>>  above dump files are somewhat larger 20mb and 1.2gb in size
>>>> respectively.
>>>> I'm thinking of putting these data in to a MySql database and do the
>>>> processing rather than process these files in-memory. Also the amount of
>>>> leaf categories and prominent nodes would be large and need to be push to a
>>>> MySql tables.
>>>>
>>>>  I want to know whether this code should be write under
>>>> extraction-framwork code,if so where should I plug this code?
>>>> or whether is it good idea to write it separately, and push to a new
>>>> repo? If I write it separately can I use a language other than Scala?
>>>>
>>>>
>>>>  --
>>>> Regards
>>>>
>>>> Kasun Perera
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> This SF.net email is sponsored by Windows:
>>>>
>>>> Build for Windows Store.
>>>> http://p.sf.net/sfu/windows-dev2dev
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Dbpedia-developers mailing 
>>>> [email protected]https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards
>>>
>>> Kasun Perera
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> This SF.net email is sponsored by Windows:
>>>
>>> Build for Windows Store.
>>>
>>> http://p.sf.net/sfu/windows-dev2dev
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>>
>
>
> --
> Regards
>
> Kasun Perera
>
>
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Reply via email to