On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio <[email protected]>wrote:
> Dear Kasun,
> I had to deal with the same problem some months ago, and I managed to use
> the XML article file: you can intercept categories using the "Category:"
> prefix, and you can infer father-son relation using the <title> tag (if the
> <title> starts with "Category:", all the categories for this page are
> possible ancestors).
> The Wikipedia category taxonomy is quite a mess, so good luck!
>
> Alessio
>
Hi Alessio
Yes I would try this.Seems this is a good option. I hope this is the
correct file
"enwiki-20130604-pages-articles.xml.bz2<http://dumps.wikimedia.org/enwiki/20130604/enwiki-20130604-pages-articles.xml.bz2>
9.2 GB" that you are refering.
@marco
Is it good idea to try several options (1-What I have said in previously
and 2-Aleseio's suggestion 3- any other option) and do some evaluation to
find out what is best method for getting leaf nodes? May be it would give
the same output?
Thanks
>
> Il 27/06/13 05:24, kasun perera ha scritto:
>
> As discussed with Marco these are the next tasks that i would be working.
>
> 1. Identification of leaf categories
> 2. Prominent leaves discovery
> 3. Pages clustering based on prominent leaves
>
> For above task 1, I'm planing to use Wikipedia category and
> category_links SQL tables available here.
> http://dumps.wikimedia.org/enwiki/20130604/
>
> above dump files are somewhat larger 20mb and 1.2gb in size respectively.
> I'm thinking of putting these data in to a MySql database and do the
> processing rather than process these files in-memory. Also the amount of
> leaf categories and prominent nodes would be large and need to be push to a
> MySql tables.
>
> I want to know whether this code should be write under
> extraction-framwork code,if so where should I plug this code?
> or whether is it good idea to write it separately, and push to a new repo?
> If I write it separately can I use a language other than Scala?
>
>
> --
> Regards
>
> Kasun Perera
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
> http://p.sf.net/sfu/windows-dev2dev
>
>
>
> _______________________________________________
> Dbpedia-developers mailing
> [email protected]https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>
>
--
Regards
Kasun Perera
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers