Hi Marco, Kasun,
I am not sure I am understanding now.
I have not verified directly yet your statement about missing categories in
the skos data, but after a rough check
$ zcat skos_categories_en.nt.gz | grep -v "^#" | cut -d">" -f1 | sort |
uniq | wc -l
862826
$ zcat category_labels_en.nt.gz | grep -v "^#" | sort | uniq | wc -l
862826
That suggests all the categories compare in the skos_categories (I presume
as skos:Concept).
What confuses me about your statement is that:
"categories that don't have a broader category are not included in
dump 2" (i.e.
skos_categories file) but then you say that
http://en.wikipedia.org/wiki/Category:1612_establishments_in_Mexico which
has a parent
http://en.wikipedia.org/wiki/Category:1610s_establishments_in_Mexico does
not appear in skos_categories.
Does that mean also categories which have broader categories are not
included in skos_categories?
Could you please elaborate?
Kasun, for what regards data freshness (Reason 2) you can recreate a new
version of DBpedia dataset for your personal use whenever you prefer (using
the extraction framework).
Cheers
Andrea
2013/7/5 Marco Fossati <[email protected]>
> Hi Kasun,
>
> On 7/5/13 2:17 PM, kasun perera wrote:
> > I was working with skos_categories but these are some reasons that I
> > would avoid using that dataset for parent-child relationship detection.
> >
> > Reason 1
> > I need to get all leaf categories AND thier child- parent relationships.
> > Categories that don't have a broader category are not included in
> > skos_category dump. This claim is discussed here
> > https://github.com/dbpedia/dbpedia-links/issues/16
> This is a finding that should be documented if not already done.
> Could you please check and eventually update the documentation?
> Also, if this is not the intended behavior of the category extractor, it
> should be reported as an issue in the extractor-framework repo.
>
> Thanks Kasun for these useful analytics.
> Cheers!
> >
> > Reason 2
> > We need to concern about data freshness dealing with tasks related to
> > knowledge representation. Debpedia latest dumps (1.8) are nealy one year
> > older. This work also need to deal with other datasets such as Wikipedia
> > page_edit_history, interlaguage links ect. So there is the need that all
> > the datasets are in sync with each other, i.e. they have the same
> > dates. If I use dbpedia dumps there is a problem of finding synchronized
> > datasets.
> >
> >
> >
> > On Fri, Jun 28, 2013 at 3:33 PM, Alessio Palmero Aprosio <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> > Dear Kasun,
> > I'm investigating graph DBs in this period, but I haven't tried any
> yet.
> > In my implementation, I'm using a Lucene index to store categories.
> > I have two fields: category name and parent. The parent is null if
> > there is no parent at all.
> > Whenever I need a path, I start from the category and go for
> > parents. If I encounter a category I already encountered before, I
> > stop the loop (otherwise it will go on forever).
> >
> > You also can use a simple MySQL database with two fields, but I
> > think Lucene is faster.
> >
> > Alessio
> >
> >
> > Il 28/06/13 10:25, kasun perera ha scritto:
> >> Hi Alessio
> >>
> >> On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio
> >> <[email protected] <mailto:[email protected]>> wrote:
> >>
> >> Dear Kasun,
> >> I had to deal with the same problem some months ago,
> >>
> >>
> >> Just curious about how did you stored the edges and vertices
> >> relationships when processing the categories.
> >> In-memory processing would be difficult since it has a huge number
> >> of edges and vertices, so I think it's good to store them in a
> >> database.
> >> I have heard about graph databases[1], but haven't worked with
> >> them. Did you use something like that or simple mysql database?
> >>
> >> [1]http://en.wikipedia.org/wiki/Graph_database
> >>
> >> and I managed to use the XML article file: you can intercept
> >> categories using the "Category:" prefix, and you can infer
> >> father-son relation using the <title> tag (if the <title>
> >> starts with "Category:", all the categories for this page are
> >> possible ancestors).
> >> The Wikipedia category taxonomy is quite a mess, so good luck!
> >>
> >> Alessio
> >>
> >>
> >> Il 27/06/13 05:24, kasun perera ha scritto:
> >>> As discussed with Marco these are the next tasks that i would
> >>> be working.
> >>>
> >>> 1. Identification of leaf categories
> >>> 2. Prominent leaves discovery
> >>> 3. Pages clustering based on prominent leaves
> >>>
> >>> For above task 1, I'm planing to use Wikipedia category and
> >>> category_links SQL tables available here.
> >>> http://dumps.wikimedia.org/enwiki/20130604/
> >>>
> >>> above dump files are somewhat larger 20mb and 1.2gb in size
> >>> respectively.
> >>> I'm thinking of putting these data in to a MySql database and
> >>> do the processing rather than process these files in-memory.
> >>> Also the amount of leaf categories and prominent nodes would
> >>> be large and need to be push to a MySql tables.
> >>>
> >>> I want to know whether this code should be write under
> >>> extraction-framwork code,if so where should I plug this code?
> >>> or whether is it good idea to write it separately, and push
> >>> to a new repo? If I write it separately can I use a language
> >>> other than Scala?
> >>>
> >>>
> >>> --
> >>> Regards
> >>>
> >>> Kasun Perera
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> This SF.net email is sponsored by Windows:
> >>>
> >>> Build for Windows Store.
> >>>
> >>> http://p.sf.net/sfu/windows-dev2dev
> >>>
> >>>
> >>> _______________________________________________
> >>> Dbpedia-developers mailing list
> >>> [email protected] <mailto:
> [email protected]>
> >>>
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
> >>
> >>
> >>
> >>
> >> --
> >> Regards
> >>
> >> Kasun Perera
> >>
> >
> >
> >
> >
> > --
> > Regards
> >
> > Kasun Perera
> >
>
> --
> Marco Fossati
> http://about.me/marco.fossati
> Twitter: @hjfocs
> Skype: hell_j
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers