Re: [Dbpedia-developers] Processing Wikipedia Categories

Alessio Palmero Aprosio Fri, 28 Jun 2013 03:05:07 -0700

Dear Kasun,
I'm investigating graph DBs in this period, but I haven't tried any yet.

In my implementation, I'm using a Lucene index to store categories. Ihave two fields: category name and parent. The parent is null if thereis no parent at all.Whenever I need a path, I start from the category and go for parents. IfI encounter a category I already encountered before, I stop the loop(otherwise it will go on forever).

You also can use a simple MySQL database with two fields, but I thinkLucene is faster.


Alessio


Il 28/06/13 10:25, kasun perera ha scritto:

Hi Alessio

On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio<[email protected] <mailto:[email protected]>> wrote:


    Dear Kasun,
    I had to deal with the same problem some months ago,

Just curious about how did you stored the edges and verticesrelationships when processing the categories.In-memory processing would be difficult since it has a huge numberof edges and vertices, so I think it's good to store them in a database.I have heard about graph databases[1], but haven't worked with them.Did you use something like that or simple mysql database?


[1]http://en.wikipedia.org/wiki/Graph_database

    and I managed to use the XML article file: you can intercept
    categories using the "Category:" prefix, and you can infer
    father-son relation using the <title> tag (if the <title> starts
    with "Category:", all the categories for this page are possible
    ancestors).
    The Wikipedia category taxonomy is quite a mess, so good luck!

    Alessio


    Il 27/06/13 05:24, kasun perera ha scritto:

    As discussed with Marco these are the next tasks that i would be
    working.

    1. Identification of leaf categories
    2. Prominent leaves discovery
    3. Pages clustering based on prominent leaves

    For above task 1, I'm planing to use Wikipedia category and
    category_links SQL tables available here.
    http://dumps.wikimedia.org/enwiki/20130604/

    above dump files are somewhat larger 20mb and 1.2gb in size
    respectively.
    I'm thinking of putting these data in to a MySql database and do
    the processing rather than process these files in-memory. Also
    the amount of leaf categories and prominent nodes would be large
    and need to be push to a MySql tables.

    I want to know whether this code should be write under
    extraction-framwork code,if so where should I plug this code?
    or whether is it good idea to write it separately, and push to a
    new repo? If I write it separately can I use a language other
    than Scala?

--Regards


    Kasun Perera



    
------------------------------------------------------------------------------
    This SF.net email is sponsored by Windows:

    Build for Windows Store.

    http://p.sf.net/sfu/windows-dev2dev


    _______________________________________________
    Dbpedia-developers mailing list
    [email protected]  
<mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/dbpedia-developers





--
Regards

Kasun Perera

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Processing Wikipedia Categories

Reply via email to