Re: [Dbpedia-discussion] Italian short / long abstract problem

contacts Tue, 27 Sep 2011 01:00:37 -0700

Hello,

Option number 2 is the only way for us (time). So, if you've a link or a document about that step, it's

a great way to start.

I can produce test cases, just for example, the abstracts for http://dbpedia.org/page/Vasco_Rossi, http://dbpedia.org/page/Gina_Lollobrigida

http://dbpedia.org/page/Tom_Cruise start from the second sentence of related Wikipedia articles.

Thanks,

Alessio@ComplexityIntelligence

-------- Original Message --------

Subject: Re: [Dbpedia-discussion] Italian short / long abstract problem
From: Sebastian Hellmann <[email protected]>
Date: Tue, September 27, 2011 12:26 am
To: [email protected]
Cc: Piero Molino <[email protected]>,
[email protected]

Hello all,
I think the first important step is to produce a test case.
Can you provide cases, where the extraction fails?

The best would be to have one file per article, that contains the original Wikipedia source and one file that contains the expected output.

This will be a good basis to make regression test case in the future.

After you provided these test cases, there are two choices:
1. you can wait until the next Wikipedia dump in 6 months, I guess that is the time, when we will try and fix the problem based on the test case you provided.
2. try to fix it yourself (with a little help of us) and make a new Italian dump (or the abstract data sets), which we will use to replace the current one. (This would probably be much faster and you would learn how to produce/tune the DBpedia data)

All the best,
Sebastian

On 09/23/2011 09:45 AM, [email protected] wrote:
Hi Piero,

This seems to be a very common problem for the Italian dataset(s). I've done

some random tests, and more than 50% of abstracts are messed (this is very

random test so the numbers are only indicative).

   Just for example:

    http://dbpedia.org/page/Vasco_Rossi

    http://dbpedia.org/page/Gina_Lollobrigida

    http://dbpedia.org/page/Tom_Cruise

   and many others. Another problem 'Ornella Muti' is missed.

   As you can see, the problem is only for Italian, other language abstracts are ok.

Please let me know. I can help to fix that.

Alessio



-------- Original Message --------
Subject: Re: [Dbpedia-discussion] Italian short / long abstract problem
From: Piero Molino <[email protected]>
Date: Thu, September 22, 2011 10:12 am
To: Sebastian Hellmann <[email protected]>
Cc: [email protected],
[email protected]

Hello Alessio,

the fat that the abstract starts from the second sentence is probably due to the fact that many first sentences are generated from the templates in italian wikipedia.

A clear example is the bio template for peolpe.

So please take a look to the original wikipedia page source and see if this is the problem. If not, please give us some examples of the messed abstracts.

Regards,

Piero Molino

Il giorno 22/set/2011, alle ore 19:06, Sebastian Hellmann ha scritto:

Dear Alessio,
Sorry, but this is actually not on the top of our Todo list.
We could assist you a little in fixing the problem.
Would you be willing to try it?
Sebastian

On 09/22/2011 02:02 PM, [email protected] wrote:

Hello,

   In DBPedia 3.7, a big amunt of long and short abstract for the Italian language are messed.

They start from the second sentence of the Wikipedia article, skipping the first one, so the

abstract as a whole is of a little use as the subject is often unclear.

   Is possible to fix that problem ?

Alessio

------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2dcopy1

_______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Italian short / long abstract problem

Reply via email to