Hello all,
I think the first important step is to produce a test case.
Can you provide cases, where the extraction fails?
The best would be to have one file per article, that contains the
original Wikipedia source and one file that contains the expected output.
This will be a good basis to make regression test case in the future.
After you provided these test cases, there are two choices:
1. you can wait until the next Wikipedia dump in 6 months, I guess that
is the time, when we will try and fix the problem based on the test case
you provided.
2. try to fix it yourself (with a little help of us) and make a new
Italian dump (or the abstract data sets), which we will use to replace
the current one. (This would probably be much faster and you would learn
how to produce/tune the DBpedia data)
All the best,
Sebastian
On 09/23/2011 09:45 AM, [email protected] wrote:
Hi Piero,
This seems to be a very common problem for the Italian dataset(s).
I've done
some random tests, and more than 50% of abstracts are messed (this is very
random test so the numbers are only indicative).
Just for example:
http://dbpedia.org/page/Vasco_Rossi
http://dbpedia.org/page/Gina_Lollobrigida
http://dbpedia.org/page/Tom_Cruise
and many others. Another problem 'Ornella Muti' is missed.
As you can see, the problem is only for Italian, other language
abstracts are ok.
Please let me know. I can help to fix that.
Alessio
-------- Original Message --------
Subject: Re: [Dbpedia-discussion] Italian short / long abstract
problem
From: Piero Molino <[email protected]
<http://[email protected]>>
Date: Thu, September 22, 2011 10:12 am
To: Sebastian Hellmann <[email protected]
<mailto:[email protected]>>
Cc: [email protected]
<mailto:[email protected]>,
[email protected]
<mailto:[email protected]>
Hello Alessio,
the fat that the abstract starts from the second sentence is
probably due to the fact that many first sentences are generated
from the templates in italian wikipedia.
A clear example is the bio template for peolpe.
So please take a look to the original wikipedia page source and
see if this is the problem. If not, please give us some examples
of the messed abstracts.
Regards,
Piero Molino
Il giorno 22/set/2011, alle ore 19:06, Sebastian Hellmann ha scritto:
Dear Alessio,
Sorry, but this is actually not on the top of our Todo list.
We could assist you a little in fixing the problem.
Would you be willing to try it?
Sebastian
On 09/22/2011 02:02 PM, [email protected] wrote:
Hello,
In DBPedia 3.7, a big amunt of long and short abstract for
the Italian language are messed.
They start from the second sentence of the Wikipedia article,
skipping the first one, so the
abstract as a whole is of a little use as the subject is often
unclear.
Is possible to fix that problem ?
Alessio
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and
makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1_______________________________________________
Dbpedia-discussion mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion