Re: [Dbpedia-developers] Processing Wikipedia Categories

Andrea Di Menna Fri, 05 Jul 2013 06:27:40 -0700

Hi Marco,

$ bzgrep "1612_establishments_in_Mexico" skos_categories_en.ttl.bz2
<http://dbpedia.org/resource/Category:1612_establishments_in_Mexico> <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
http://www.w3.org/2004/02/skos/core#Concept> .
<http://dbpedia.org/resource/Category:1612_establishments_in_Mexico> <
http://www.w3.org/2004/02/skos/core#prefLabel> "1612 establishments in
Mexico"@en .


So the category referred to in the issue seems to be there in the file.

Would you mind:
1. Amending the issue with the correct datafile? (it refers to N-Triples
version and not the Turtle one - i.e.
http://downloads.dbpedia.org/3.8/en/skos_categories_en.nt.bz2)
2. Checking again the statement "categories that don't have a broader
category are not included in dump 2" which seems to be wrong

Is it possible that what you were stating is that categories which are
referred to in the articles are not all in the skos file maybe?

Cheers
Andrea


2013/7/5 Marco Fossati <[email protected]>

> Hi Andrea,
>
> Yo are checking the wrong datasets.
> We are using [1] and [2].
> Cheers,
>
> [1] http://downloads.dbpedia.org/**preview.php?file=3.8_sl_en_sl_**
> article_categories_en.ttl.bz2<http://downloads.dbpedia.org/preview.php?file=3.8_sl_en_sl_article_categories_en.ttl.bz2>
> [2] http://downloads.dbpedia.org/**preview.php?file=3.8_sl_en_sl_**
> skos_categories_en.ttl.bz2<http://downloads.dbpedia.org/preview.php?file=3.8_sl_en_sl_skos_categories_en.ttl.bz2>
>
>
> On 7/5/13 3:17 PM, Andrea Di Menna wrote:
>
>> On top of this:
>>
>> $ bzgrep "1612_establishments_in_**Mexico" skos_categories_en.nt.bz2
>> <http://dbpedia.org/resource/**Category:1612_establishments_**in_Mexico<http://dbpedia.org/resource/Category:1612_establishments_in_Mexico>
>> >
>> <http://www.w3.org/1999/02/22-**rdf-syntax-ns#type<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>> >
>> <http://www.w3.org/2004/02/**skos/core#Concept<http://www.w3.org/2004/02/skos/core#Concept>>
>> .
>>
>> So the category you are referring to in [1] is present in skos_categories.
>> It does not have a parent category in this case, but I think this is
>> because parent categories are added using a template which might be
>> unsupported in the extraction framework:
>> {{estcatCountry|161|2|Mexico}}
>>
>> Am I wrong?
>>
>> [1] 
>> https://github.com/dbpedia/**dbpedia-links/issues/16<https://github.com/dbpedia/dbpedia-links/issues/16>
>>
>>
>> 2013/7/5 Andrea Di Menna <[email protected] <mailto:[email protected]>>
>>
>>
>>     Hi Marco, Kasun,
>>
>>     I am not sure I am understanding now.
>>
>>     I have not verified directly yet your statement about missing
>>     categories in the skos data, but after a rough check
>>
>>     $ zcat skos_categories_en.nt.gz | grep -v "^#" | cut -d">" -f1 |
>>     sort | uniq | wc -l
>>     862826
>>     $ zcat category_labels_en.nt.gz | grep -v "^#" | sort | uniq | wc -l
>>     862826
>>
>>     That suggests all the categories compare in the skos_categories (I
>>     presume as skos:Concept).
>>
>>     What confuses me about your statement is that:
>>     "categories that don't have a broader category are not included in
>>     dump 2" (i.e. skos_categories file) but then you say that
>>     http://en.wikipedia.org/wiki/**Category:1612_establishments_**
>> in_Mexico<http://en.wikipedia.org/wiki/Category:1612_establishments_in_Mexico>which
>>     has a parent
>>     http://en.wikipedia.org/wiki/**Category:1610s_establishments_**
>> in_Mexico<http://en.wikipedia.org/wiki/Category:1610s_establishments_in_Mexico>does
>>     not appear in skos_categories.
>>     Does that mean also categories which have broader categories are not
>>     included in skos_categories?
>>     Could you please elaborate?
>>
>>     Kasun, for what regards data freshness (Reason 2) you can recreate a
>>     new version of DBpedia dataset for your personal use whenever you
>>     prefer (using the extraction framework).
>>
>>     Cheers
>>     Andrea
>>
>>
>>     2013/7/5 Marco Fossati <[email protected]
>>     <mailto:[email protected]>>
>>
>>
>>         Hi Kasun,
>>
>>         On 7/5/13 2:17 PM, kasun perera wrote:
>>          > I was working with skos_categories but these are some reasons
>>         that I
>>          > would avoid using that dataset for parent-child relationship
>>         detection.
>>          >
>>          > Reason 1
>>          > I need to get all leaf categories AND thier child- parent
>>         relationships.
>>          > Categories that don't have a broader category are not included
>> in
>>          > skos_category dump. This claim is discussed here
>>          > 
>> https://github.com/dbpedia/**dbpedia-links/issues/16<https://github.com/dbpedia/dbpedia-links/issues/16>
>>         This is a finding that should be documented if not already done.
>>         Could you please check and eventually update the documentation?
>>         Also, if this is not the intended behavior of the category
>>         extractor, it
>>         should be reported as an issue in the extractor-framework repo.
>>
>>         Thanks Kasun for these useful analytics.
>>         Cheers!
>>          >
>>          > Reason 2
>>          > We need to concern about data freshness dealing with tasks
>>         related to
>>          > knowledge representation. Debpedia latest dumps (1.8) are
>>         nealy one year
>>          > older. This work also need to deal with other datasets such
>>         as Wikipedia
>>          > page_edit_history, interlaguage links ect. So there is the
>>         need that all
>>          > the datasets  are in sync with each other, i.e. they have the
>>         same
>>          > dates. If I use dbpedia dumps there is a problem of finding
>>         synchronized
>>          > datasets.
>>          >
>>          >
>>          >
>>          > On Fri, Jun 28, 2013 at 3:33 PM, Alessio Palmero Aprosio
>>         <[email protected] <mailto:[email protected]>
>>          > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>          >
>>          >     Dear Kasun,
>>          >     I'm investigating graph DBs in this period, but I haven't
>>         tried any yet.
>>          >     In my implementation, I'm using a Lucene index to store
>>         categories.
>>          >     I have two fields: category name and parent. The parent
>>         is null if
>>          >     there is no parent at all.
>>          >     Whenever I need a path, I start from the category and go
>> for
>>          >     parents. If I encounter a category I already encountered
>>         before, I
>>          >     stop the loop (otherwise it will go on forever).
>>          >
>>          >     You also can use a simple MySQL database with two fields,
>>         but I
>>          >     think Lucene is faster.
>>          >
>>          >     Alessio
>>          >
>>          >
>>          >     Il 28/06/13 10:25, kasun perera ha scritto:
>>          >>     Hi Alessio
>>          >>
>>          >>     On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio
>>          >>     <[email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>          >>
>>          >>         Dear Kasun,
>>          >>         I had to deal with the same problem some months ago,
>>          >>
>>          >>
>>          >>     Just curious about how did you stored the edges and
>> vertices
>>          >>     relationships when processing the categories.
>>          >>     In-memory processing would be difficult since it has a
>>         huge number
>>          >>     of edges and vertices, so I think it's good to store
>>         them in a
>>          >>     database.
>>          >>     I have heard about graph databases[1], but haven't
>>         worked with
>>          >>     them. Did you use something like that or simple mysql
>>         database?
>>          >>
>>          >>     
>> [1]http://en.wikipedia.org/**wiki/Graph_database<http://en.wikipedia.org/wiki/Graph_database>
>>          >>
>>          >>         and I managed to use the XML article file: you can
>>         intercept
>>          >>         categories using the "Category:" prefix, and you can
>>         infer
>>          >>         father-son relation using the <title> tag (if the
>>         <title>
>>          >>         starts with "Category:", all the categories for this
>>         page are
>>          >>         possible ancestors).
>>          >>         The Wikipedia category taxonomy is quite a mess, so
>>         good luck!
>>          >>
>>          >>         Alessio
>>          >>
>>          >>
>>          >>         Il 27/06/13 05:24, kasun perera ha scritto:
>>          >>>         As discussed with Marco these are the next tasks
>>         that i would
>>          >>>         be working.
>>          >>>
>>          >>>         1. Identification of leaf categories
>>          >>>         2. Prominent leaves discovery
>>          >>>         3. Pages clustering based on prominent leaves
>>          >>>
>>          >>>         For above task 1, I'm planing to use Wikipedia
>>         category and
>>          >>>         category_links SQL tables available here.
>>          >>> 
>> http://dumps.wikimedia.org/**enwiki/20130604/<http://dumps.wikimedia.org/enwiki/20130604/>
>>          >>>
>>          >>>         above dump files are somewhat larger 20mb and 1.2gb
>>         in size
>>          >>>         respectively.
>>          >>>         I'm thinking of putting these data in to a MySql
>>         database and
>>          >>>         do the processing rather than process these files
>>         in-memory.
>>          >>>         Also the amount of leaf categories and prominent
>>         nodes would
>>          >>>         be large and need to be push to a MySql tables.
>>          >>>
>>          >>>         I want to know whether this code should be write
>> under
>>          >>>         extraction-framwork code,if so where should I plug
>>         this code?
>>          >>>         or whether is it good idea to write it separately,
>>         and push
>>          >>>         to a new repo? If I write it separately can I use a
>>         language
>>          >>>         other than Scala?
>>          >>>
>>          >>>
>>          >>>         --
>>          >>>         Regards
>>          >>>
>>          >>>         Kasun Perera
>>          >>>
>>          >>>
>>          >>>
>>          >>>
>>         ------------------------------**------------------------------**
>> ------------------
>>          >>>         This SF.net email is sponsored by Windows:
>>          >>>
>>          >>>         Build for Windows Store.
>>          >>>
>>          >>> 
>> http://p.sf.net/sfu/windows-**dev2dev<http://p.sf.net/sfu/windows-dev2dev>
>>          >>>
>>          >>>
>>          >>>         ______________________________**_________________
>>          >>>         Dbpedia-developers mailing list
>>          >>> 
>> Dbpedia-developers@lists.**sourceforge.net<[email protected]>
>>         
>> <mailto:Dbpedia-developers@**lists.sourceforge.net<[email protected]>
>> >
>>           
>> <mailto:Dbpedia-developers@**lists.sourceforge.net<[email protected]>
>>
>>         
>> <mailto:Dbpedia-developers@**lists.sourceforge.net<[email protected]>
>> >>
>>          >>> https://lists.sourceforge.net/**lists/listinfo/dbpedia-**
>> developers<https://lists.sourceforge.net/lists/listinfo/dbpedia-developers>
>>          >>
>>          >>
>>          >>
>>          >>
>>          >>     --
>>          >>     Regards
>>          >>
>>          >>     Kasun Perera
>>          >>
>>          >
>>          >
>>          >
>>          >
>>          > --
>>          > Regards
>>          >
>>          > Kasun Perera
>>          >
>>
>>         --
>>         Marco Fossati
>>         http://about.me/marco.fossati
>>         Twitter: @hjfocs
>>         Skype: hell_j
>>
>>         ------------------------------**------------------------------**
>> ------------------
>>         This SF.net email is sponsored by Windows:
>>
>>         Build for Windows Store.
>>
>>         
>> http://p.sf.net/sfu/windows-**dev2dev<http://p.sf.net/sfu/windows-dev2dev>
>>         ______________________________**_________________
>>         Dbpedia-developers mailing list
>>         
>> Dbpedia-developers@lists.**sourceforge.net<[email protected]>
>>         
>> <mailto:Dbpedia-developers@**lists.sourceforge.net<[email protected]>
>> >
>>         https://lists.sourceforge.net/**lists/listinfo/dbpedia-**
>> developers<https://lists.sourceforge.net/lists/listinfo/dbpedia-developers>
>>
>>
>>
>>
> --
> Marco Fossati
> http://about.me/marco.fossati
> Twitter: @hjfocs
> Skype: hell_j
>

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Processing Wikipedia Categories

Reply via email to