Re: [Dbpedia-developers] Processing Wikipedia Categories

Marco Fossati Fri, 05 Jul 2013 05:40:56 -0700

Hi Kasun,

On 7/5/13 2:17 PM, kasun perera wrote:
> I was working with skos_categories but these are some reasons that I
> would avoid using that dataset for parent-child relationship detection.
>
> Reason 1
> I need to get all leaf categories AND thier child- parent relationships.
> Categories that don't have a broader category are not included in
> skos_category dump. This claim is discussed here
> https://github.com/dbpedia/dbpedia-links/issues/16
This is a finding that should be documented if not already done.
Could you please check and eventually update the documentation?
Also, if this is not the intended behavior of the category extractor, it 
should be reported as an issue in the extractor-framework repo.


Thanks Kasun for these useful analytics.
Cheers!
>
> Reason 2
> We need to concern about data freshness dealing with tasks related to
> knowledge representation. Debpedia latest dumps (1.8) are nealy one year
> older. This work also need to deal with other datasets such as Wikipedia
> page_edit_history, interlaguage links ect. So there is the need that all
> the datasets  are in sync with each other, i.e. they have the same
> dates. If I use dbpedia dumps there is a problem of finding synchronized
> datasets.
>
>
>
> On Fri, Jun 28, 2013 at 3:33 PM, Alessio Palmero Aprosio <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Dear Kasun,
>     I'm investigating graph DBs in this period, but I haven't tried any yet.
>     In my implementation, I'm using a Lucene index to store categories.
>     I have two fields: category name and parent. The parent is null if
>     there is no parent at all.
>     Whenever I need a path, I start from the category and go for
>     parents. If I encounter a category I already encountered before, I
>     stop the loop (otherwise it will go on forever).
>
>     You also can use a simple MySQL database with two fields, but I
>     think Lucene is faster.
>
>     Alessio
>
>
>     Il 28/06/13 10:25, kasun perera ha scritto:
>>     Hi Alessio
>>
>>     On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio
>>     <[email protected] <mailto:[email protected]>> wrote:
>>
>>         Dear Kasun,
>>         I had to deal with the same problem some months ago,
>>
>>
>>     Just curious about how did you stored the edges and vertices
>>     relationships when processing the categories.
>>     In-memory processing would be difficult since it has a huge number
>>     of edges and vertices, so I think it's good to store them in a
>>     database.
>>     I have heard about graph databases[1], but haven't worked with
>>     them. Did you use something like that or simple mysql database?
>>
>>     [1]http://en.wikipedia.org/wiki/Graph_database
>>
>>         and I managed to use the XML article file: you can intercept
>>         categories using the "Category:" prefix, and you can infer
>>         father-son relation using the <title> tag (if the <title>
>>         starts with "Category:", all the categories for this page are
>>         possible ancestors).
>>         The Wikipedia category taxonomy is quite a mess, so good luck!
>>
>>         Alessio
>>
>>
>>         Il 27/06/13 05:24, kasun perera ha scritto:
>>>         As discussed with Marco these are the next tasks that i would
>>>         be working.
>>>
>>>         1. Identification of leaf categories
>>>         2. Prominent leaves discovery
>>>         3. Pages clustering based on prominent leaves
>>>
>>>         For above task 1, I'm planing to use Wikipedia category and
>>>         category_links SQL tables available here.
>>>         http://dumps.wikimedia.org/enwiki/20130604/
>>>
>>>         above dump files are somewhat larger 20mb and 1.2gb in size
>>>         respectively.
>>>         I'm thinking of putting these data in to a MySql database and
>>>         do the processing rather than process these files in-memory.
>>>         Also the amount of leaf categories and prominent nodes would
>>>         be large and need to be push to a MySql tables.
>>>
>>>         I want to know whether this code should be write under
>>>         extraction-framwork code,if so where should I plug this code?
>>>         or whether is it good idea to write it separately, and push
>>>         to a new repo? If I write it separately can I use a language
>>>         other than Scala?
>>>
>>>
>>>         --
>>>         Regards
>>>
>>>         Kasun Perera
>>>
>>>
>>>
>>>         
>>> ------------------------------------------------------------------------------
>>>         This SF.net email is sponsored by Windows:
>>>
>>>         Build for Windows Store.
>>>
>>>         http://p.sf.net/sfu/windows-dev2dev
>>>
>>>
>>>         _______________________________________________
>>>         Dbpedia-developers mailing list
>>>         [email protected]  
>>> <mailto:[email protected]>
>>>         https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>>
>>
>>     --
>>     Regards
>>
>>     Kasun Perera
>>
>
>
>
>
> --
> Regards
>
> Kasun Perera
>

-- 
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Processing Wikipedia Categories

Reply via email to