Re: [Dbpedia-developers] Processing Wikipedia Categories

Marco Fossati Fri, 05 Jul 2013 06:19:25 -0700

Hi Andrea,

Yo are checking the wrong datasets.
We are using [1] and [2].
Cheers,


[1] 
http://downloads.dbpedia.org/preview.php?file=3.8_sl_en_sl_article_categories_en.ttl.bz2
[2] 
http://downloads.dbpedia.org/preview.php?file=3.8_sl_en_sl_skos_categories_en.ttl.bz2

On 7/5/13 3:17 PM, Andrea Di Menna wrote:
> On top of this:
>
> $ bzgrep "1612_establishments_in_Mexico" skos_categories_en.nt.bz2
> <http://dbpedia.org/resource/Category:1612_establishments_in_Mexico>
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
> <http://www.w3.org/2004/02/skos/core#Concept> .
>
> So the category you are referring to in [1] is present in skos_categories.
> It does not have a parent category in this case, but I think this is
> because parent categories are added using a template which might be
> unsupported in the extraction framework:
> {{estcatCountry|161|2|Mexico}}
>
> Am I wrong?
>
> [1] https://github.com/dbpedia/dbpedia-links/issues/16
>
>
> 2013/7/5 Andrea Di Menna <[email protected] <mailto:[email protected]>>
>
>     Hi Marco, Kasun,
>
>     I am not sure I am understanding now.
>
>     I have not verified directly yet your statement about missing
>     categories in the skos data, but after a rough check
>
>     $ zcat skos_categories_en.nt.gz | grep -v "^#" | cut -d">" -f1 |
>     sort | uniq | wc -l
>     862826
>     $ zcat category_labels_en.nt.gz | grep -v "^#" | sort | uniq | wc -l
>     862826
>
>     That suggests all the categories compare in the skos_categories (I
>     presume as skos:Concept).
>
>     What confuses me about your statement is that:
>     "categories that don't have a broader category are not included in
>     dump 2" (i.e. skos_categories file) but then you say that
>     http://en.wikipedia.org/wiki/Category:1612_establishments_in_Mexico which
>     has a parent
>     http://en.wikipedia.org/wiki/Category:1610s_establishments_in_Mexico does
>     not appear in skos_categories.
>     Does that mean also categories which have broader categories are not
>     included in skos_categories?
>     Could you please elaborate?
>
>     Kasun, for what regards data freshness (Reason 2) you can recreate a
>     new version of DBpedia dataset for your personal use whenever you
>     prefer (using the extraction framework).
>
>     Cheers
>     Andrea
>
>
>     2013/7/5 Marco Fossati <[email protected]
>     <mailto:[email protected]>>
>
>         Hi Kasun,
>
>         On 7/5/13 2:17 PM, kasun perera wrote:
>          > I was working with skos_categories but these are some reasons
>         that I
>          > would avoid using that dataset for parent-child relationship
>         detection.
>          >
>          > Reason 1
>          > I need to get all leaf categories AND thier child- parent
>         relationships.
>          > Categories that don't have a broader category are not included in
>          > skos_category dump. This claim is discussed here
>          > https://github.com/dbpedia/dbpedia-links/issues/16
>         This is a finding that should be documented if not already done.
>         Could you please check and eventually update the documentation?
>         Also, if this is not the intended behavior of the category
>         extractor, it
>         should be reported as an issue in the extractor-framework repo.
>
>         Thanks Kasun for these useful analytics.
>         Cheers!
>          >
>          > Reason 2
>          > We need to concern about data freshness dealing with tasks
>         related to
>          > knowledge representation. Debpedia latest dumps (1.8) are
>         nealy one year
>          > older. This work also need to deal with other datasets such
>         as Wikipedia
>          > page_edit_history, interlaguage links ect. So there is the
>         need that all
>          > the datasets  are in sync with each other, i.e. they have the
>         same
>          > dates. If I use dbpedia dumps there is a problem of finding
>         synchronized
>          > datasets.
>          >
>          >
>          >
>          > On Fri, Jun 28, 2013 at 3:33 PM, Alessio Palmero Aprosio
>         <[email protected] <mailto:[email protected]>
>          > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>          >
>          >     Dear Kasun,
>          >     I'm investigating graph DBs in this period, but I haven't
>         tried any yet.
>          >     In my implementation, I'm using a Lucene index to store
>         categories.
>          >     I have two fields: category name and parent. The parent
>         is null if
>          >     there is no parent at all.
>          >     Whenever I need a path, I start from the category and go for
>          >     parents. If I encounter a category I already encountered
>         before, I
>          >     stop the loop (otherwise it will go on forever).
>          >
>          >     You also can use a simple MySQL database with two fields,
>         but I
>          >     think Lucene is faster.
>          >
>          >     Alessio
>          >
>          >
>          >     Il 28/06/13 10:25, kasun perera ha scritto:
>          >>     Hi Alessio
>          >>
>          >>     On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio
>          >>     <[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>> wrote:
>          >>
>          >>         Dear Kasun,
>          >>         I had to deal with the same problem some months ago,
>          >>
>          >>
>          >>     Just curious about how did you stored the edges and vertices
>          >>     relationships when processing the categories.
>          >>     In-memory processing would be difficult since it has a
>         huge number
>          >>     of edges and vertices, so I think it's good to store
>         them in a
>          >>     database.
>          >>     I have heard about graph databases[1], but haven't
>         worked with
>          >>     them. Did you use something like that or simple mysql
>         database?
>          >>
>          >>     [1]http://en.wikipedia.org/wiki/Graph_database
>          >>
>          >>         and I managed to use the XML article file: you can
>         intercept
>          >>         categories using the "Category:" prefix, and you can
>         infer
>          >>         father-son relation using the <title> tag (if the
>         <title>
>          >>         starts with "Category:", all the categories for this
>         page are
>          >>         possible ancestors).
>          >>         The Wikipedia category taxonomy is quite a mess, so
>         good luck!
>          >>
>          >>         Alessio
>          >>
>          >>
>          >>         Il 27/06/13 05:24, kasun perera ha scritto:
>          >>>         As discussed with Marco these are the next tasks
>         that i would
>          >>>         be working.
>          >>>
>          >>>         1. Identification of leaf categories
>          >>>         2. Prominent leaves discovery
>          >>>         3. Pages clustering based on prominent leaves
>          >>>
>          >>>         For above task 1, I'm planing to use Wikipedia
>         category and
>          >>>         category_links SQL tables available here.
>          >>> http://dumps.wikimedia.org/enwiki/20130604/
>          >>>
>          >>>         above dump files are somewhat larger 20mb and 1.2gb
>         in size
>          >>>         respectively.
>          >>>         I'm thinking of putting these data in to a MySql
>         database and
>          >>>         do the processing rather than process these files
>         in-memory.
>          >>>         Also the amount of leaf categories and prominent
>         nodes would
>          >>>         be large and need to be push to a MySql tables.
>          >>>
>          >>>         I want to know whether this code should be write under
>          >>>         extraction-framwork code,if so where should I plug
>         this code?
>          >>>         or whether is it good idea to write it separately,
>         and push
>          >>>         to a new repo? If I write it separately can I use a
>         language
>          >>>         other than Scala?
>          >>>
>          >>>
>          >>>         --
>          >>>         Regards
>          >>>
>          >>>         Kasun Perera
>          >>>
>          >>>
>          >>>
>          >>>
>         
> ------------------------------------------------------------------------------
>          >>>         This SF.net email is sponsored by Windows:
>          >>>
>          >>>         Build for Windows Store.
>          >>>
>          >>> http://p.sf.net/sfu/windows-dev2dev
>          >>>
>          >>>
>          >>>         _______________________________________________
>          >>>         Dbpedia-developers mailing list
>          >>> [email protected]
>         <mailto:[email protected]>
>           <mailto:[email protected]
>         <mailto:[email protected]>>
>          >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>          >>
>          >>
>          >>
>          >>
>          >>     --
>          >>     Regards
>          >>
>          >>     Kasun Perera
>          >>
>          >
>          >
>          >
>          >
>          > --
>          > Regards
>          >
>          > Kasun Perera
>          >
>
>         --
>         Marco Fossati
>         http://about.me/marco.fossati
>         Twitter: @hjfocs
>         Skype: hell_j
>
>         
> ------------------------------------------------------------------------------
>         This SF.net email is sponsored by Windows:
>
>         Build for Windows Store.
>
>         http://p.sf.net/sfu/windows-dev2dev
>         _______________________________________________
>         Dbpedia-developers mailing list
>         [email protected]
>         <mailto:[email protected]>
>         https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>
>

-- 
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Processing Wikipedia Categories

Reply via email to