Re: [Dbpedia-developers] Processing Wikipedia Categories

Andrea Di Menna Fri, 05 Jul 2013 06:10:48 -0700

Hi Marco, Kasun,

I am not sure I am understanding now.


I have not verified directly yet your statement about missing categories in
the skos data, but after a rough check

$ zcat skos_categories_en.nt.gz | grep -v "^#" | cut -d">" -f1 | sort |
uniq | wc -l
862826
$ zcat category_labels_en.nt.gz | grep -v "^#" | sort | uniq | wc -l
862826

That suggests all the categories compare in the skos_categories (I presume
as skos:Concept).

What confuses me about your statement is that:
"categories that don't have a broader category are not included in
dump 2" (i.e.
skos_categories file) but then you say that
http://en.wikipedia.org/wiki/Category:1612_establishments_in_Mexico which
has a parent
http://en.wikipedia.org/wiki/Category:1610s_establishments_in_Mexico does
not appear in skos_categories.
Does that mean also categories which have broader categories are not
included in skos_categories?
Could you please elaborate?

Kasun, for what regards data freshness (Reason 2) you can recreate a new
version of DBpedia dataset for your personal use whenever you prefer (using
the extraction framework).

Cheers
Andrea


2013/7/5 Marco Fossati <[email protected]>

> Hi Kasun,
>
> On 7/5/13 2:17 PM, kasun perera wrote:
> > I was working with skos_categories but these are some reasons that I
> > would avoid using that dataset for parent-child relationship detection.
> >
> > Reason 1
> > I need to get all leaf categories AND thier child- parent relationships.
> > Categories that don't have a broader category are not included in
> > skos_category dump. This claim is discussed here
> > https://github.com/dbpedia/dbpedia-links/issues/16
> This is a finding that should be documented if not already done.
> Could you please check and eventually update the documentation?
> Also, if this is not the intended behavior of the category extractor, it
> should be reported as an issue in the extractor-framework repo.
>
> Thanks Kasun for these useful analytics.
> Cheers!
> >
> > Reason 2
> > We need to concern about data freshness dealing with tasks related to
> > knowledge representation. Debpedia latest dumps (1.8) are nealy one year
> > older. This work also need to deal with other datasets such as Wikipedia
> > page_edit_history, interlaguage links ect. So there is the need that all
> > the datasets  are in sync with each other, i.e. they have the same
> > dates. If I use dbpedia dumps there is a problem of finding synchronized
> > datasets.
> >
> >
> >
> > On Fri, Jun 28, 2013 at 3:33 PM, Alessio Palmero Aprosio <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Dear Kasun,
> >     I'm investigating graph DBs in this period, but I haven't tried any
> yet.
> >     In my implementation, I'm using a Lucene index to store categories.
> >     I have two fields: category name and parent. The parent is null if
> >     there is no parent at all.
> >     Whenever I need a path, I start from the category and go for
> >     parents. If I encounter a category I already encountered before, I
> >     stop the loop (otherwise it will go on forever).
> >
> >     You also can use a simple MySQL database with two fields, but I
> >     think Lucene is faster.
> >
> >     Alessio
> >
> >
> >     Il 28/06/13 10:25, kasun perera ha scritto:
> >>     Hi Alessio
> >>
> >>     On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio
> >>     <[email protected] <mailto:[email protected]>> wrote:
> >>
> >>         Dear Kasun,
> >>         I had to deal with the same problem some months ago,
> >>
> >>
> >>     Just curious about how did you stored the edges and vertices
> >>     relationships when processing the categories.
> >>     In-memory processing would be difficult since it has a huge number
> >>     of edges and vertices, so I think it's good to store them in a
> >>     database.
> >>     I have heard about graph databases[1], but haven't worked with
> >>     them. Did you use something like that or simple mysql database?
> >>
> >>     [1]http://en.wikipedia.org/wiki/Graph_database
> >>
> >>         and I managed to use the XML article file: you can intercept
> >>         categories using the "Category:" prefix, and you can infer
> >>         father-son relation using the <title> tag (if the <title>
> >>         starts with "Category:", all the categories for this page are
> >>         possible ancestors).
> >>         The Wikipedia category taxonomy is quite a mess, so good luck!
> >>
> >>         Alessio
> >>
> >>
> >>         Il 27/06/13 05:24, kasun perera ha scritto:
> >>>         As discussed with Marco these are the next tasks that i would
> >>>         be working.
> >>>
> >>>         1. Identification of leaf categories
> >>>         2. Prominent leaves discovery
> >>>         3. Pages clustering based on prominent leaves
> >>>
> >>>         For above task 1, I'm planing to use Wikipedia category and
> >>>         category_links SQL tables available here.
> >>>         http://dumps.wikimedia.org/enwiki/20130604/
> >>>
> >>>         above dump files are somewhat larger 20mb and 1.2gb in size
> >>>         respectively.
> >>>         I'm thinking of putting these data in to a MySql database and
> >>>         do the processing rather than process these files in-memory.
> >>>         Also the amount of leaf categories and prominent nodes would
> >>>         be large and need to be push to a MySql tables.
> >>>
> >>>         I want to know whether this code should be write under
> >>>         extraction-framwork code,if so where should I plug this code?
> >>>         or whether is it good idea to write it separately, and push
> >>>         to a new repo? If I write it separately can I use a language
> >>>         other than Scala?
> >>>
> >>>
> >>>         --
> >>>         Regards
> >>>
> >>>         Kasun Perera
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>>         This SF.net email is sponsored by Windows:
> >>>
> >>>         Build for Windows Store.
> >>>
> >>>         http://p.sf.net/sfu/windows-dev2dev
> >>>
> >>>
> >>>         _______________________________________________
> >>>         Dbpedia-developers mailing list
> >>>         [email protected]  <mailto:
> [email protected]>
> >>>
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
> >>
> >>
> >>
> >>
> >>     --
> >>     Regards
> >>
> >>     Kasun Perera
> >>
> >
> >
> >
> >
> > --
> > Regards
> >
> > Kasun Perera
> >
>
> --
> Marco Fossati
> http://about.me/marco.fossati
> Twitter: @hjfocs
> Skype: hell_j
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Processing Wikipedia Categories

Reply via email to