On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio <[email protected]>wrote:

>  Dear Kasun,
> I had to deal with the same problem some months ago, and I managed to use
> the XML article file: you can intercept categories using the "Category:"
> prefix, and you can infer father-son relation using the <title> tag (if the
> <title> starts with "Category:", all the categories for this page are
> possible ancestors).
> The Wikipedia category taxonomy is quite a mess, so good luck!
>
> Alessio
>


Hi Alessio

Yes I would try this.Seems this is a good option. I hope this is the
correct file 
"enwiki-20130604-pages-articles.xml.bz2<http://dumps.wikimedia.org/enwiki/20130604/enwiki-20130604-pages-articles.xml.bz2>
 9.2 GB"  that you are refering.

@marco
Is it good idea to try several options (1-What I have said in previously
and 2-Aleseio's suggestion 3- any other option) and do some evaluation to
find out what is best method for getting leaf nodes? May be it would give
the same output?

Thanks


>
> Il 27/06/13 05:24, kasun perera ha scritto:
>
>  As discussed with Marco these are the next tasks that i would be working.
>
>  1. Identification of leaf categories
> 2. Prominent leaves discovery
> 3. Pages clustering based on prominent leaves
>
>  For above task 1, I'm planing to use Wikipedia category and
> category_links SQL tables available here.
> http://dumps.wikimedia.org/enwiki/20130604/
>
>  above dump files are somewhat larger 20mb and 1.2gb in size respectively.
> I'm thinking of putting these data in to a MySql database and do the
> processing rather than process these files in-memory. Also the amount of
> leaf categories and prominent nodes would be large and need to be push to a
> MySql tables.
>
>  I want to know whether this code should be write under
> extraction-framwork code,if so where should I plug this code?
> or whether is it good idea to write it separately, and push to a new repo?
> If I write it separately can I use a language other than Scala?
>
>
>  --
> Regards
>
> Kasun Perera
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
> http://p.sf.net/sfu/windows-dev2dev
>
>
>
> _______________________________________________
> Dbpedia-developers mailing 
> [email protected]https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>
>


-- 
Regards

Kasun Perera
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Reply via email to