Re: [Dbpedia-developers] Processing Wikipedia Categories

Andrea Di Menna Fri, 05 Jul 2013 06:17:33 -0700

On top of this:

$ bzgrep "1612_establishments_in_Mexico" skos_categories_en.nt.bz2
<http://dbpedia.org/resource/Category:1612_establishments_in_Mexico> <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
http://www.w3.org/2004/02/skos/core#Concept> .


So the category you are referring to in [1] is present in skos_categories.
It does not have a parent category in this case, but I think this is
because parent categories are added using a template which might be
unsupported in the extraction framework:
{{estcatCountry|161|2|Mexico}}

Am I wrong?

[1] https://github.com/dbpedia/dbpedia-links/issues/16


2013/7/5 Andrea Di Menna <[email protected]>

> Hi Marco, Kasun,
>
> I am not sure I am understanding now.
>
> I have not verified directly yet your statement about missing categories
> in the skos data, but after a rough check
>
> $ zcat skos_categories_en.nt.gz | grep -v "^#" | cut -d">" -f1 | sort |
> uniq | wc -l
> 862826
> $ zcat category_labels_en.nt.gz | grep -v "^#" | sort | uniq | wc -l
> 862826
>
> That suggests all the categories compare in the skos_categories (I presume
> as skos:Concept).
>
> What confuses me about your statement is that:
> "categories that don't have a broader category are not included in dump
> 2" (i.e. skos_categories file) but then you say that
> http://en.wikipedia.org/wiki/Category:1612_establishments_in_Mexico which
> has a parent
> http://en.wikipedia.org/wiki/Category:1610s_establishments_in_Mexico does
> not appear in skos_categories.
> Does that mean also categories which have broader categories are not
> included in skos_categories?
> Could you please elaborate?
>
> Kasun, for what regards data freshness (Reason 2) you can recreate a new
> version of DBpedia dataset for your personal use whenever you prefer (using
> the extraction framework).
>
> Cheers
> Andrea
>
>
> 2013/7/5 Marco Fossati <[email protected]>
>
>> Hi Kasun,
>>
>> On 7/5/13 2:17 PM, kasun perera wrote:
>> > I was working with skos_categories but these are some reasons that I
>> > would avoid using that dataset for parent-child relationship detection.
>> >
>> > Reason 1
>> > I need to get all leaf categories AND thier child- parent relationships.
>> > Categories that don't have a broader category are not included in
>> > skos_category dump. This claim is discussed here
>> > https://github.com/dbpedia/dbpedia-links/issues/16
>> This is a finding that should be documented if not already done.
>> Could you please check and eventually update the documentation?
>> Also, if this is not the intended behavior of the category extractor, it
>> should be reported as an issue in the extractor-framework repo.
>>
>> Thanks Kasun for these useful analytics.
>> Cheers!
>> >
>> > Reason 2
>> > We need to concern about data freshness dealing with tasks related to
>> > knowledge representation. Debpedia latest dumps (1.8) are nealy one year
>> > older. This work also need to deal with other datasets such as Wikipedia
>> > page_edit_history, interlaguage links ect. So there is the need that all
>> > the datasets  are in sync with each other, i.e. they have the same
>> > dates. If I use dbpedia dumps there is a problem of finding synchronized
>> > datasets.
>> >
>> >
>> >
>> > On Fri, Jun 28, 2013 at 3:33 PM, Alessio Palmero Aprosio <
>> [email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> >     Dear Kasun,
>> >     I'm investigating graph DBs in this period, but I haven't tried any
>> yet.
>> >     In my implementation, I'm using a Lucene index to store categories.
>> >     I have two fields: category name and parent. The parent is null if
>> >     there is no parent at all.
>> >     Whenever I need a path, I start from the category and go for
>> >     parents. If I encounter a category I already encountered before, I
>> >     stop the loop (otherwise it will go on forever).
>> >
>> >     You also can use a simple MySQL database with two fields, but I
>> >     think Lucene is faster.
>> >
>> >     Alessio
>> >
>> >
>> >     Il 28/06/13 10:25, kasun perera ha scritto:
>> >>     Hi Alessio
>> >>
>> >>     On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio
>> >>     <[email protected] <mailto:[email protected]>> wrote:
>> >>
>> >>         Dear Kasun,
>> >>         I had to deal with the same problem some months ago,
>> >>
>> >>
>> >>     Just curious about how did you stored the edges and vertices
>> >>     relationships when processing the categories.
>> >>     In-memory processing would be difficult since it has a huge number
>> >>     of edges and vertices, so I think it's good to store them in a
>> >>     database.
>> >>     I have heard about graph databases[1], but haven't worked with
>> >>     them. Did you use something like that or simple mysql database?
>> >>
>> >>     [1]http://en.wikipedia.org/wiki/Graph_database
>> >>
>> >>         and I managed to use the XML article file: you can intercept
>> >>         categories using the "Category:" prefix, and you can infer
>> >>         father-son relation using the <title> tag (if the <title>
>> >>         starts with "Category:", all the categories for this page are
>> >>         possible ancestors).
>> >>         The Wikipedia category taxonomy is quite a mess, so good luck!
>> >>
>> >>         Alessio
>> >>
>> >>
>> >>         Il 27/06/13 05:24, kasun perera ha scritto:
>> >>>         As discussed with Marco these are the next tasks that i would
>> >>>         be working.
>> >>>
>> >>>         1. Identification of leaf categories
>> >>>         2. Prominent leaves discovery
>> >>>         3. Pages clustering based on prominent leaves
>> >>>
>> >>>         For above task 1, I'm planing to use Wikipedia category and
>> >>>         category_links SQL tables available here.
>> >>>         http://dumps.wikimedia.org/enwiki/20130604/
>> >>>
>> >>>         above dump files are somewhat larger 20mb and 1.2gb in size
>> >>>         respectively.
>> >>>         I'm thinking of putting these data in to a MySql database and
>> >>>         do the processing rather than process these files in-memory.
>> >>>         Also the amount of leaf categories and prominent nodes would
>> >>>         be large and need to be push to a MySql tables.
>> >>>
>> >>>         I want to know whether this code should be write under
>> >>>         extraction-framwork code,if so where should I plug this code?
>> >>>         or whether is it good idea to write it separately, and push
>> >>>         to a new repo? If I write it separately can I use a language
>> >>>         other than Scala?
>> >>>
>> >>>
>> >>>         --
>> >>>         Regards
>> >>>
>> >>>         Kasun Perera
>> >>>
>> >>>
>> >>>
>> >>>
>> ------------------------------------------------------------------------------
>> >>>         This SF.net email is sponsored by Windows:
>> >>>
>> >>>         Build for Windows Store.
>> >>>
>> >>>         http://p.sf.net/sfu/windows-dev2dev
>> >>>
>> >>>
>> >>>         _______________________________________________
>> >>>         Dbpedia-developers mailing list
>> >>>         [email protected]  <mailto:
>> [email protected]>
>> >>>
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >>
>> >>
>> >>
>> >>
>> >>     --
>> >>     Regards
>> >>
>> >>     Kasun Perera
>> >>
>> >
>> >
>> >
>> >
>> > --
>> > Regards
>> >
>> > Kasun Perera
>> >
>>
>> --
>> Marco Fossati
>> http://about.me/marco.fossati
>> Twitter: @hjfocs
>> Skype: hell_j
>>
>>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Windows:
>>
>> Build for Windows Store.
>>
>> http://p.sf.net/sfu/windows-dev2dev
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>
>

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Processing Wikipedia Categories

Reply via email to