Re: [Dbpedia-discussion] Italian short / long abstract problem

Sebastian Hellmann Tue, 27 Sep 2011 00:28:00 -0700

Hello all,
I think the first important step is to produce a test case.
Can you provide cases, where the extraction fails?

The best would be to have one file per article, that contains theoriginal Wikipedia source and one file that contains the expected output.


This will be a good basis to make regression test case in the future.

After you provided these test cases, there are two choices:

1. you can wait until the next Wikipedia dump in 6 months, I guess thatis the time, when we will try and fix the problem based on the test caseyou provided.2. try to fix it yourself (with a little help of us) and make a newItalian dump (or the abstract data sets), which we will use to replacethe current one. (This would probably be much faster and you would learnhow to produce/tune the DBpedia data)


All the best,
Sebastian




On 09/23/2011 09:45 AM, [email protected] wrote:

Hi Piero,

This seems to be a very common problem for the Italian dataset(s).I've done

some random tests, and more than 50% of abstracts are messed (this is very
random test so the numbers are only indicative).

   Just for example:

http://dbpedia.org/page/Vasco_Rossi
http://dbpedia.org/page/Gina_Lollobrigida
http://dbpedia.org/page/Tom_Cruise

   and many others. Another problem 'Ornella Muti' is missed.

As you can see, the problem is only for Italian, other languageabstracts are ok.


Please let me know. I can help to fix that.

Alessio

    -------- Original Message --------
    Subject: Re: [Dbpedia-discussion] Italian short / long abstract
    problem
    From: Piero Molino <[email protected]
    <http://[email protected]>>
    Date: Thu, September 22, 2011 10:12 am
    To: Sebastian Hellmann <[email protected]
    <mailto:[email protected]>>
    Cc: [email protected]
    <mailto:[email protected]>,
    [email protected]
    <mailto:[email protected]>

    Hello Alessio,

    the fat that the abstract starts from the second sentence is
    probably due to the fact that many first sentences are generated
    from the templates in italian wikipedia.
    A clear example is the bio template for peolpe.
    So please take a look to the original wikipedia page source and
    see if this is the problem. If not, please give us some examples
    of the messed abstracts.

    Regards,
    Piero Molino



    Il giorno 22/set/2011, alle ore 19:06, Sebastian Hellmann ha scritto:

    Dear Alessio,
    Sorry, but this is actually not on the top of our Todo list.
    We could assist you  a little in fixing the problem.
    Would you be willing to try it?
    Sebastian

    On 09/22/2011 02:02 PM, [email protected] wrote:

    Hello,

       In DBPedia 3.7, a big amunt of long and short abstract for
    the Italian language are messed.
    They start from the second sentence of the Wikipedia article,
    skipping the first one, so the
    abstract as a whole is of a little use as the subject is often
    unclear.

       Is possible to fix that problem ?

    Alessio


    
------------------------------------------------------------------------------
    All the data continuously generated in your IT infrastructure contains a
    definitive record of customers, application performance, security
    threats, fraudulent activity and more. Splunk takes this data and makes
    sense of it. Business sense. IT sense. Common sense.
    http://p.sf.net/sfu/splunk-d2dcopy1


    _______________________________________________
    Dbpedia-discussion mailing list
    [email protected]
    https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion


    
------------------------------------------------------------------------------
    All the data continuously generated in your IT infrastructure
    contains a
    definitive record of customers, application performance, security
    threats, fraudulent activity and more. Splunk takes this data and
    makes
    sense of it. Business sense. IT sense. Common sense.
    
http://p.sf.net/sfu/splunk-d2dcopy1_______________________________________________
    Dbpedia-discussion mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Italian short / long abstract problem

Reply via email to