Re: [Dbpedia-developers] Processing Wikipedia Categories

Marco Fossati Fri, 05 Jul 2013 06:49:40 -0700


On 7/5/13 3:26 PM, Andrea Di Menna wrote:
> Hi Marco,
>
> $ bzgrep "1612_establishments_in_Mexico" skos_categories_en.ttl.bz2
> <http://dbpedia.org/resource/Category:1612_establishments_in_Mexico>
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
> <http://www.w3.org/2004/02/skos/core#Concept> .
> <http://dbpedia.org/resource/Category:1612_establishments_in_Mexico>
> <http://www.w3.org/2004/02/skos/core#prefLabel> "1612 establishments in
> Mexico"@en .
>
> So the category referred to in the issue seems to be there in the file.
@Kasun, can you check this please?
>
> Would you mind:
> 1. Amending the issue with the correct datafile? (it refers to N-Triples
> version and not the Turtle one - i.e.
> http://downloads.dbpedia.org/3.8/en/skos_categories_en.nt.bz2)
Sorry, bad copy/paste in the issue. Fixed, thanks for pointing it out :-)
> 2. Checking again the statement "categories that don't have a broader
> category are not included in dump 2" which seems to be wrong
>
> Is it possible that what you were stating is that categories which are
> referred to in the articles are not all in the skos file maybe?
>
> Cheers
> Andrea
>
>
> 2013/7/5 Marco Fossati <[email protected] <mailto:[email protected]>>
>
>     Hi Andrea,
>
>     Yo are checking the wrong datasets.
>     We are using [1] and [2].
>     Cheers,
>
>     [1]
>     
> http://downloads.dbpedia.org/__preview.php?file=3.8_sl_en_sl___article_categories_en.ttl.bz2
>     
> <http://downloads.dbpedia.org/preview.php?file=3.8_sl_en_sl_article_categories_en.ttl.bz2>
>     [2]
>     
> http://downloads.dbpedia.org/__preview.php?file=3.8_sl_en_sl___skos_categories_en.ttl.bz2
>     
> <http://downloads.dbpedia.org/preview.php?file=3.8_sl_en_sl_skos_categories_en.ttl.bz2>
>
>
>     On 7/5/13 3:17 PM, Andrea Di Menna wrote:
>
>         On top of this:
>
>         $ bzgrep "1612_establishments_in___Mexico" skos_categories_en.nt.bz2
>         
> <http://dbpedia.org/resource/__Category:1612_establishments___in_Mexico
>         <http://dbpedia.org/resource/Category:1612_establishments_in_Mexico>>
>         <http://www.w3.org/1999/02/22-__rdf-syntax-ns#type
>         <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>>
>         <http://www.w3.org/2004/02/__skos/core#Concept
>         <http://www.w3.org/2004/02/skos/core#Concept>> .
>
>         So the category you are referring to in [1] is present in
>         skos_categories.
>         It does not have a parent category in this case, but I think this is
>         because parent categories are added using a template which might be
>         unsupported in the extraction framework:
>         {{estcatCountry|161|2|Mexico}}
>
>         Am I wrong?
>
>         [1] https://github.com/dbpedia/__dbpedia-links/issues/16
>         <https://github.com/dbpedia/dbpedia-links/issues/16>
>
>
>         2013/7/5 Andrea Di Menna <[email protected]
>         <mailto:[email protected]> <mailto:[email protected]
>         <mailto:[email protected]>>>
>
>
>              Hi Marco, Kasun,
>
>              I am not sure I am understanding now.
>
>              I have not verified directly yet your statement about missing
>              categories in the skos data, but after a rough check
>
>              $ zcat skos_categories_en.nt.gz | grep -v "^#" | cut -d">"
>         -f1 |
>              sort | uniq | wc -l
>              862826
>              $ zcat category_labels_en.nt.gz | grep -v "^#" | sort |
>         uniq | wc -l
>              862826
>
>              That suggests all the categories compare in the
>         skos_categories (I
>              presume as skos:Concept).
>
>              What confuses me about your statement is that:
>              "categories that don't have a broader category are not
>         included in
>              dump 2" (i.e. skos_categories file) but then you say that
>         
> http://en.wikipedia.org/wiki/__Category:1612_establishments___in_Mexico
>         <http://en.wikipedia.org/wiki/Category:1612_establishments_in_Mexico>
>         which
>              has a parent
>         
> http://en.wikipedia.org/wiki/__Category:1610s_establishments___in_Mexico
>         <http://en.wikipedia.org/wiki/Category:1610s_establishments_in_Mexico>
>         does
>              not appear in skos_categories.
>              Does that mean also categories which have broader
>         categories are not
>              included in skos_categories?
>              Could you please elaborate?
>
>              Kasun, for what regards data freshness (Reason 2) you can
>         recreate a
>              new version of DBpedia dataset for your personal use
>         whenever you
>              prefer (using the extraction framework).
>
>              Cheers
>              Andrea
>
>
>              2013/7/5 Marco Fossati <[email protected]
>         <mailto:[email protected]>
>              <mailto:[email protected] <mailto:[email protected]>>>
>
>
>                  Hi Kasun,
>
>                  On 7/5/13 2:17 PM, kasun perera wrote:
>                   > I was working with skos_categories but these are
>         some reasons
>                  that I
>                   > would avoid using that dataset for parent-child
>         relationship
>                  detection.
>                   >
>                   > Reason 1
>                   > I need to get all leaf categories AND thier child-
>         parent
>                  relationships.
>                   > Categories that don't have a broader category are
>         not included in
>                   > skos_category dump. This claim is discussed here
>                   > https://github.com/dbpedia/__dbpedia-links/issues/16
>         <https://github.com/dbpedia/dbpedia-links/issues/16>
>                  This is a finding that should be documented if not
>         already done.
>                  Could you please check and eventually update the
>         documentation?
>                  Also, if this is not the intended behavior of the category
>                  extractor, it
>                  should be reported as an issue in the
>         extractor-framework repo.
>
>                  Thanks Kasun for these useful analytics.
>                  Cheers!
>                   >
>                   > Reason 2
>                   > We need to concern about data freshness dealing with
>         tasks
>                  related to
>                   > knowledge representation. Debpedia latest dumps
>         (1.8) are
>                  nealy one year
>                   > older. This work also need to deal with other
>         datasets such
>                  as Wikipedia
>                   > page_edit_history, interlaguage links ect. So there
>         is the
>                  need that all
>                   > the datasets  are in sync with each other, i.e. they
>         have the
>                  same
>                   > dates. If I use dbpedia dumps there is a problem of
>         finding
>                  synchronized
>                   > datasets.
>                   >
>                   >
>                   >
>                   > On Fri, Jun 28, 2013 at 3:33 PM, Alessio Palmero Aprosio
>                  <[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>
>                   > <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>                   >
>                   >     Dear Kasun,
>                   >     I'm investigating graph DBs in this period, but
>         I haven't
>                  tried any yet.
>                   >     In my implementation, I'm using a Lucene index
>         to store
>                  categories.
>                   >     I have two fields: category name and parent. The
>         parent
>                  is null if
>                   >     there is no parent at all.
>                   >     Whenever I need a path, I start from the
>         category and go for
>                   >     parents. If I encounter a category I already
>         encountered
>                  before, I
>                   >     stop the loop (otherwise it will go on forever).
>                   >
>                   >     You also can use a simple MySQL database with
>         two fields,
>                  but I
>                   >     think Lucene is faster.
>                   >
>                   >     Alessio
>                   >
>                   >
>                   >     Il 28/06/13 10:25, kasun perera ha scritto:
>                   >>     Hi Alessio
>                   >>
>                   >>     On Thu, Jun 27, 2013 at 1:42 PM, Alessio
>         Palmero Aprosio
>                   >>     <[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>
>                  <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>                   >>
>                   >>         Dear Kasun,
>                   >>         I had to deal with the same problem some
>         months ago,
>                   >>
>                   >>
>                   >>     Just curious about how did you stored the edges
>         and vertices
>                   >>     relationships when processing the categories.
>                   >>     In-memory processing would be difficult since
>         it has a
>                  huge number
>                   >>     of edges and vertices, so I think it's good to
>         store
>                  them in a
>                   >>     database.
>                   >>     I have heard about graph databases[1], but haven't
>                  worked with
>                   >>     them. Did you use something like that or simple
>         mysql
>                  database?
>                   >>
>                   >>
>         [1]http://en.wikipedia.org/__wiki/Graph_database
>         <http://en.wikipedia.org/wiki/Graph_database>
>                   >>
>                   >>         and I managed to use the XML article file:
>         you can
>                  intercept
>                   >>         categories using the "Category:" prefix,
>         and you can
>                  infer
>                   >>         father-son relation using the <title> tag
>         (if the
>                  <title>
>                   >>         starts with "Category:", all the categories
>         for this
>                  page are
>                   >>         possible ancestors).
>                   >>         The Wikipedia category taxonomy is quite a
>         mess, so
>                  good luck!
>                   >>
>                   >>         Alessio
>                   >>
>                   >>
>                   >>         Il 27/06/13 05:24, kasun perera ha scritto:
>                   >>>         As discussed with Marco these are the next
>         tasks
>                  that i would
>                   >>>         be working.
>                   >>>
>                   >>>         1. Identification of leaf categories
>                   >>>         2. Prominent leaves discovery
>                   >>>         3. Pages clustering based on prominent leaves
>                   >>>
>                   >>>         For above task 1, I'm planing to use Wikipedia
>                  category and
>                   >>>         category_links SQL tables available here.
>                   >>> http://dumps.wikimedia.org/__enwiki/20130604/
>         <http://dumps.wikimedia.org/enwiki/20130604/>
>                   >>>
>                   >>>         above dump files are somewhat larger 20mb
>         and 1.2gb
>                  in size
>                   >>>         respectively.
>                   >>>         I'm thinking of putting these data in to a
>         MySql
>                  database and
>                   >>>         do the processing rather than process
>         these files
>                  in-memory.
>                   >>>         Also the amount of leaf categories and
>         prominent
>                  nodes would
>                   >>>         be large and need to be push to a MySql
>         tables.
>                   >>>
>                   >>>         I want to know whether this code should be
>         write under
>                   >>>         extraction-framwork code,if so where
>         should I plug
>                  this code?
>                   >>>         or whether is it good idea to write it
>         separately,
>                  and push
>                   >>>         to a new repo? If I write it separately
>         can I use a
>                  language
>                   >>>         other than Scala?
>                   >>>
>                   >>>
>                   >>>         --
>                   >>>         Regards
>                   >>>
>                   >>>         Kasun Perera
>                   >>>
>                   >>>
>                   >>>
>                   >>>
>
>         
> ------------------------------__------------------------------__------------------
>                   >>>         This SF.net email is sponsored by Windows:
>                   >>>
>                   >>>         Build for Windows Store.
>                   >>>
>                   >>> http://p.sf.net/sfu/windows-__dev2dev
>         <http://p.sf.net/sfu/windows-dev2dev>
>                   >>>
>                   >>>
>                   >>>
>         _________________________________________________
>                   >>>         Dbpedia-developers mailing list
>                   >>> Dbpedia-developers@lists.__sourceforge.net
>         <mailto:[email protected]>
>                  <mailto:Dbpedia-developers@__lists.sourceforge.net
>         <mailto:[email protected]>>
>                    <mailto:Dbpedia-developers@__lists.sourceforge.net
>         <mailto:[email protected]>
>
>                  <mailto:Dbpedia-developers@__lists.sourceforge.net
>         <mailto:[email protected]>>>
>                   >>>
>         https://lists.sourceforge.net/__lists/listinfo/dbpedia-__developers
>         <https://lists.sourceforge.net/lists/listinfo/dbpedia-developers>
>                   >>
>                   >>
>                   >>
>                   >>
>                   >>     --
>                   >>     Regards
>                   >>
>                   >>     Kasun Perera
>                   >>
>                   >
>                   >
>                   >
>                   >
>                   > --
>                   > Regards
>                   >
>                   > Kasun Perera
>                   >
>
>                  --
>                  Marco Fossati
>         http://about.me/marco.fossati
>                  Twitter: @hjfocs
>                  Skype: hell_j
>
>
>         
> ------------------------------__------------------------------__------------------
>                  This SF.net email is sponsored by Windows:
>
>                  Build for Windows Store.
>
>         http://p.sf.net/sfu/windows-__dev2dev
>         <http://p.sf.net/sfu/windows-dev2dev>
>                  _________________________________________________
>                  Dbpedia-developers mailing list
>         Dbpedia-developers@lists.__sourceforge.net
>         <mailto:[email protected]>
>                  <mailto:Dbpedia-developers@__lists.sourceforge.net
>         <mailto:[email protected]>>
>         https://lists.sourceforge.net/__lists/listinfo/dbpedia-__developers
>         <https://lists.sourceforge.net/lists/listinfo/dbpedia-developers>
>
>
>
>
>     --
>     Marco Fossati
>     http://about.me/marco.fossati
>     Twitter: @hjfocs
>     Skype: hell_j
>
>


-- 
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Processing Wikipedia Categories

Reply via email to