I completely misunderstood what you were saying. I thought that you
asked me for abstract generation quality feedback in general. Now I
realized that you are referring to the fact that I generated abstracts
without a local MediaWiki instance. What I did however may be different
from what you suspect though.
Here's what I did:
- I saw that you invoke api.php of local MediaWiki instance to parse
wiki text. I didn't bother to set it up so I just replaced the URL with
actual Wikipedia instance of the language I worked on. This caused the
wiki text to be rendered with templates substituted.
- After this modification I parsed wiki text from XML database dump
using SimpleWikiParser and passed the PageNode to getAbstractWikiText
method in modified AbstractExtractor
- I saw that the returned text contains HTML markup so I removed it
using an HTML sanitizer. I assumed that you use "modified" MediaWiki to
cover this part but I wasn't sure.
- I was not happy with short method in AbstractExtractor because it
didn't recognize sentence boundaries correctly. I created my own
shortening routine using java.text.BreakIterator with additional
abbreviations checks.
From what you're saying below I suspect that you are interested in
generating abstracts without a need to invoke MediaWiki neither locally
nor remotely. That I haven't tried to do.
Sorry for the confusion but I'm very new to all this and I'm just trying
to use some of the extraction framework code for my purposes. Are we on
the same page now?
Regards,
Piotr
On 2012-10-03 10:08, Pablo N. Mendes wrote:
I have searched a bit through the list and only found an example in
Italian.
*Article:*
http://it.wikipedia.org/wiki/Vasco_Rossi
*Rendered text:*
Vasco Rossi, anche noto come Vasco o con l'appellativo Il Blasco[7]
(Zocca, 7 febbraio 1952), è un cantautore italiano.
*Source:*
{{Bio
|Nome = Vasco
|Cognome = Rossi
|PostCognomeVirgola = anche noto come '''Vasco''' o con l'appellativo
'''''Il
Blasco'''''<ref>[http://archivio.lastampa.it/LaStampaArchivio/main/History/tmpl_viewObj.jsp?objid=1092556
Ma Vasco Rossi torna a giugno. Il Blasco piace sempre]
archivio.lastampa.it <http://archivio.lastampa.it></ref>
|Sesso = M
|LuogoNascita = Zocca
|GiornoMeseNascita = 7 febbraio
|AnnoNascita = 1952
|LuogoMorte =
|GiornoMeseMorte =
|AnnoMorte =
|Attività = cantautore
|Nazionalità = italiano
}}
If you could compare the output for both solutions with a few such
pages, we could have an initial assessment of "text quality" as
Dimitris put it.
Cheers,
Pablo
On Wed, Oct 3, 2012 at 9:30 AM, Dimitris Kontokostas
<[email protected] <mailto:[email protected]>> wrote:
I don't have a concrete test-case, I have to search in blind.
What I was thinking is that if we could create the abstracts with
exactly the same way as the modified mw we could make a string
comparison and test how many are different and how. Depending on
the number and frequency of the text rendering templates that
exist in the abstracts result we could try to resolve them manually.
Removing the local Wikipedia mirror dependency for the extraction
could be a huge plus but we shouldn't compromise on quality.
Any other ideas?
Best,
Dimitris
On Wed, Oct 3, 2012 at 9:41 AM, Pablo N. Mendes
<[email protected] <mailto:[email protected]>> wrote:
Perhaps it would help the discussion if we got more concrete.
Dimitris, do you have a favorite abstract that is problematic
(therefore justifies using the modified MediaWiki)? Perhaps
you can paste the wiki markup source and the desired outcome
and Piotr can respond with the rendering by his patch.
On Oct 3, 2012 8:31 AM, "Dimitris Kontokostas"
<[email protected] <mailto:[email protected]>> wrote:
>
>
> On Wed, Oct 3, 2012 at 12:42 AM, Piotr Jagielski
<[email protected] <mailto:[email protected]>> wrote:
>>
>> What do you mean by text quality? The text itself is as
good as the first couple of sentences in the Wikipedia article
you take it from, right?
>
>
> Well, that is what I am asking :) Is it (exactly) the same text?
> The problem is with some templates that render text (i.e.
date templates) If we can measure their usage extend we could
see if this is the way to go.
>
> Best,
> Dimitris
>
>>
>>
>> Piotr
>>
>>
>> On 2012-10-02 22:49, Dimitris Kontokostas wrote:
>>>
>>> Our main interest is the text quality, if we get this
right the shortening / tweaking should be the easy part :)
>>>
>>> Could you please give us with some text quality feedback
and if it is good maybe we can start testing it to other
languages as well
>>>
>>> Best,
>>> Dimitris
>>>
>>> On Tue, Oct 2, 2012 at 11:11 PM, Piotr Jagielski
<[email protected] <mailto:[email protected]>> wrote:
>>>>
>>>> I haven't done extensive tests but one thing to improve
for sure is the abstract shortening algorithm. You currently
use a simple regex to solve a complex problem of breaking down
natural language text into sentences. java.text.BreakIterator
yields better results and is also locale sensitive. You might
also want to take a look at more advanced boundary analysis
library at http://userguide.icu-project.org/boundaryanalysis.
>>>>
>>>> Regards,
>>>> Piotr
--
Kontokostas Dimitris
--
---
Pablo N. Mendes
http://pablomendes.com
Events: http://wole2012.eurecom.fr <http://wole2012.eurecom.fr/>
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion