Hi Andrea
On Thu, Jun 27, 2013 at 8:43 PM, Andrea Di Menna <[email protected]> wrote:
>
> [1] org.dbpedia.extraction.mappings.SkosCategoriesExtractor
>
Actually I didn't know it was there, that's why I didn't try/mention it. If
it is the correct option I would use it :)
thanks
>
>
> 2013/6/27 kasun perera <[email protected]>
>
>>
>> On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio
>> <[email protected]>wrote:
>>
>>> Dear Kasun,
>>> I had to deal with the same problem some months ago, and I managed to
>>> use the XML article file: you can intercept categories using the
>>> "Category:" prefix, and you can infer father-son relation using the <title>
>>> tag (if the <title> starts with "Category:", all the categories for this
>>> page are possible ancestors).
>>> The Wikipedia category taxonomy is quite a mess, so good luck!
>>>
>>> Alessio
>>>
>>
>>
>> Hi Alessio
>>
>> Yes I would try this.Seems this is a good option. I hope this is the
>> correct file
>> "enwiki-20130604-pages-articles.xml.bz2<http://dumps.wikimedia.org/enwiki/20130604/enwiki-20130604-pages-articles.xml.bz2>
>> 9.2 GB" that you are refering.
>>
>> @marco
>> Is it good idea to try several options (1-What I have said in previously
>> and 2-Aleseio's suggestion 3- any other option) and do some evaluation to
>> find out what is best method for getting leaf nodes? May be it would give
>> the same output?
>>
>> Thanks
>>
>>
>>>
>>> Il 27/06/13 05:24, kasun perera ha scritto:
>>>
>>> As discussed with Marco these are the next tasks that i would be
>>> working.
>>>
>>> 1. Identification of leaf categories
>>> 2. Prominent leaves discovery
>>> 3. Pages clustering based on prominent leaves
>>>
>>> For above task 1, I'm planing to use Wikipedia category and
>>> category_links SQL tables available here.
>>> http://dumps.wikimedia.org/enwiki/20130604/
>>>
>>> above dump files are somewhat larger 20mb and 1.2gb in size
>>> respectively.
>>> I'm thinking of putting these data in to a MySql database and do the
>>> processing rather than process these files in-memory. Also the amount of
>>> leaf categories and prominent nodes would be large and need to be push to a
>>> MySql tables.
>>>
>>> I want to know whether this code should be write under
>>> extraction-framwork code,if so where should I plug this code?
>>> or whether is it good idea to write it separately, and push to a new
>>> repo? If I write it separately can I use a language other than Scala?
>>>
>>>
>>> --
>>> Regards
>>>
>>> Kasun Perera
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> This SF.net email is sponsored by Windows:
>>>
>>> Build for Windows Store.
>>> http://p.sf.net/sfu/windows-dev2dev
>>>
>>>
>>>
>>> _______________________________________________
>>> Dbpedia-developers mailing
>>> [email protected]https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>>>
>>
>>
>> --
>> Regards
>>
>> Kasun Perera
>>
>>
>>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Windows:
>>
>> Build for Windows Store.
>>
>> http://p.sf.net/sfu/windows-dev2dev
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>
--
Regards
Kasun Perera
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers