I didn't dig in the logic behind the SKOS categories extractor.
Maybe @Chris has the answer?

On 6/27/13 5:32 PM, Andrea Di Menna wrote:
> Kasun,
>
> the SkosCategoriesExtractor will produce the file I mentioned some mails
> ago [1]
> This is why I don't think we need another extractor to process category
> data from wikipedia articles, unless the current code is wrong of course :)
>
> I personally think you could start working directly on the skos
> hierarchy and focus on the interesting part of your investigation, i.e.
> leaves and parents-with-leaves-only-as-children
>
> WDYT? Mentors?
>
> Regards
> Andrea
>
> [1] http://wiki.dbpedia.org/Downloads38#categories-skos
>
>
> 2013/6/27 kasun perera <[email protected]
> <mailto:[email protected]>>
>
>
>     Hi Andrea
>
>     On Thu, Jun 27, 2013 at 8:43 PM, Andrea Di Menna <[email protected]
>     <mailto:[email protected]>> wrote:
>
>
>         [1] org.dbpedia.extraction.mappings.SkosCategoriesExtractor
>
>
>     Actually I didn't know it was there, that's why I didn't try/mention
>     it. If it is the correct option I would use it :)
>
>     thanks
>
>
>
>
>         2013/6/27 kasun perera <[email protected]
>         <mailto:[email protected]>>
>
>
>             On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio
>             <[email protected] <mailto:[email protected]>> wrote:
>
>                 Dear Kasun,
>                 I had to deal with the same problem some months ago, and
>                 I managed to use the XML article file: you can intercept
>                 categories using the "Category:" prefix, and you can
>                 infer father-son relation using the <title> tag (if the
>                 <title> starts with "Category:", all the categories for
>                 this page are possible ancestors).
>                 The Wikipedia category taxonomy is quite a mess, so good
>                 luck!
>
>                 Alessio
>
>
>
>             Hi Alessio
>
>             Yes I would try this.Seems this is a good option. I hope
>             this is the correct file
>             "enwiki-20130604-pages-articles.xml.bz2
>             
> <http://dumps.wikimedia.org/enwiki/20130604/enwiki-20130604-pages-articles.xml.bz2>9.2
>             GB"  that you are refering.
>
>             @marco
>             Is it good idea to try several options (1-What I have said
>             in previously and 2-Aleseio's suggestion 3- any other
>             option) and do some evaluation to find out what is best
>             method for getting leaf nodes? May be it would give the same
>             output?
>
>             Thanks
>
>
>
>                 Il 27/06/13 05:24, kasun perera ha scritto:
>>                 As discussed with Marco these are the next tasks that
>>                 i would be working.
>>
>>                 1. Identification of leaf categories
>>                 2. Prominent leaves discovery
>>                 3. Pages clustering based on prominent leaves
>>
>>                 For above task 1, I'm planing to use Wikipedia
>>                 category and category_links SQL tables available here.
>>                 http://dumps.wikimedia.org/enwiki/20130604/
>>
>>                 above dump files are somewhat larger 20mb and 1.2gb in
>>                 size respectively.
>>                 I'm thinking of putting these data in to a MySql
>>                 database and do the processing rather than process
>>                 these files in-memory. Also the amount of leaf
>>                 categories and prominent nodes would be large and need
>>                 to be push to a MySql tables.
>>
>>                 I want to know whether this code should be write under
>>                 extraction-framwork code,if so where should I plug
>>                 this code?
>>                 or whether is it good idea to write it separately, and
>>                 push to a new repo? If I write it separately can I use
>>                 a language other than Scala?
>>
>>
>>                 --
>>                 Regards
>>
>>                 Kasun Perera
>>
>>
>>
>>                 
>> ------------------------------------------------------------------------------
>>                 This SF.net email is sponsored by Windows:
>>
>>                 Build for Windows Store.
>>
>>                 http://p.sf.net/sfu/windows-dev2dev
>>
>>
>>                 _______________________________________________
>>                 Dbpedia-developers mailing list
>>                 [email protected]  
>> <mailto:[email protected]>
>>                 
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>
>
>
>             --
>             Regards
>
>             Kasun Perera
>
>
>             
> ------------------------------------------------------------------------------
>             This SF.net email is sponsored by Windows:
>
>             Build for Windows Store.
>
>             http://p.sf.net/sfu/windows-dev2dev
>             _______________________________________________
>             Dbpedia-developers mailing list
>             [email protected]
>             <mailto:[email protected]>
>             https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>
>
>
>
>     --
>     Regards
>
>     Kasun Perera
>
>

-- 
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Reply via email to