I'm interested in exploring wiki text to text approach. I'm just
wondering what's your idea to do it without either local MediaWiki or
remote Wikipedia call. Do you have any code in extraction framework that
can be used to parse wiki markup? I tried toPlainText() on PageNode but
it appears only to replace links but it keeps all wiki formatting like
bold, italics, lists etc.
Regards,
Piotr
On 2012-10-04 08:16, Dimitris Kontokostas wrote:
I think you did exactly that with an unnecessary call to wikipedia.
The PageNode is a parameter to the AbstractExtractor.extract so you
could call that directly.
The patched mw is here:
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/efc0afb0faa3/abstractExtraction
I was thinking we have 2 future tasks regarding this
1) Create an "abstract" mediawiki extension and get rid of the patched
old mediawiki
2) See if a wikitext2text approach works (what you tried to do)
You could use the shortening function from the mw code and then maybe
contribute your code back ;-)
Best,
Dimitris
PS. anyone else from the community that has some time to implement #1
is welcome
On Wed, Oct 3, 2012 at 9:58 PM, Piotr Jagielski <[email protected]
<mailto:[email protected]>> wrote:
I completely misunderstood what you were saying. I thought that
you asked me for abstract generation quality feedback in general.
Now I realized that you are referring to the fact that I generated
abstracts without a local MediaWiki instance. What I did however
may be different from what you suspect though.
Here's what I did:
- I saw that you invoke api.php of local MediaWiki instance to
parse wiki text. I didn't bother to set it up so I just replaced
the URL with actual Wikipedia instance of the language I worked
on. This caused the wiki text to be rendered with templates
substituted.
- After this modification I parsed wiki text from XML database
dump using SimpleWikiParser and passed the PageNode to
getAbstractWikiText method in modified AbstractExtractor
- I saw that the returned text contains HTML markup so I removed
it using an HTML sanitizer. I assumed that you use "modified"
MediaWiki to cover this part but I wasn't sure.
- I was not happy with short method in AbstractExtractor because
it didn't recognize sentence boundaries correctly. I created my
own shortening routine using java.text.BreakIterator with
additional abbreviations checks.
From what you're saying below I suspect that you are interested in
generating abstracts without a need to invoke MediaWiki neither
locally nor remotely. That I haven't tried to do.
Sorry for the confusion but I'm very new to all this and I'm just
trying to use some of the extraction framework code for my
purposes. Are we on the same page now?
Regards,
Piotr
On 2012-10-03 10:08, Pablo N. Mendes wrote:
I have searched a bit through the list and only found an example
in Italian.
*Article:*
http://it.wikipedia.org/wiki/Vasco_Rossi
*Rendered text:*
Vasco Rossi, anche noto come Vasco o con l'appellativo Il
Blasco[7] (Zocca, 7 febbraio 1952), è un cantautore italiano.
*Source:*
{{Bio
|Nome = Vasco
|Cognome = Rossi
|PostCognomeVirgola = anche noto come '''Vasco''' o con
l'appellativo '''''Il
Blasco'''''<ref>[http://archivio.lastampa.it/LaStampaArchivio/main/History/tmpl_viewObj.jsp?objid=1092556
Ma Vasco Rossi torna a giugno. Il Blasco piace sempre]
archivio.lastampa.it <http://archivio.lastampa.it></ref>
|Sesso = M
|LuogoNascita = Zocca
|GiornoMeseNascita = 7 febbraio
|AnnoNascita = 1952
|LuogoMorte =
|GiornoMeseMorte =
|AnnoMorte =
|Attività = cantautore
|Nazionalità = italiano
}}
If you could compare the output for both solutions with a few
such pages, we could have an initial assessment of "text quality"
as Dimitris put it.
Cheers,
Pablo
On Wed, Oct 3, 2012 at 9:30 AM, Dimitris Kontokostas
<[email protected] <mailto:[email protected]>> wrote:
I don't have a concrete test-case, I have to search in blind.
What I was thinking is that if we could create the abstracts
with exactly the same way as the modified mw we could make a
string comparison and test how many are different and how.
Depending on the number and frequency of the text rendering
templates that exist in the abstracts result we could try to
resolve them manually.
Removing the local Wikipedia mirror dependency for the
extraction could be a huge plus but we shouldn't compromise
on quality.
Any other ideas?
Best,
Dimitris
On Wed, Oct 3, 2012 at 9:41 AM, Pablo N. Mendes
<[email protected] <mailto:[email protected]>> wrote:
Perhaps it would help the discussion if we got more
concrete. Dimitris, do you have a favorite abstract that
is problematic (therefore justifies using the modified
MediaWiki)? Perhaps you can paste the wiki markup source
and the desired outcome and Piotr can respond with the
rendering by his patch.
On Oct 3, 2012 8:31 AM, "Dimitris Kontokostas"
<[email protected] <mailto:[email protected]>> wrote:
>
>
> On Wed, Oct 3, 2012 at 12:42 AM, Piotr Jagielski
<[email protected] <mailto:[email protected]>> wrote:
>>
>> What do you mean by text quality? The text itself is
as good as the first couple of sentences in the Wikipedia
article you take it from, right?
>
>
> Well, that is what I am asking :) Is it (exactly) the
same text?
> The problem is with some templates that render text
(i.e. date templates) If we can measure their usage
extend we could see if this is the way to go.
>
> Best,
> Dimitris
>
>>
>>
>> Piotr
>>
>>
>> On 2012-10-02 22:49, Dimitris Kontokostas wrote:
>>>
>>> Our main interest is the text quality, if we get this
right the shortening / tweaking should be the easy part :)
>>>
>>> Could you please give us with some text quality
feedback and if it is good maybe we can start testing it
to other languages as well
>>>
>>> Best,
>>> Dimitris
>>>
>>> On Tue, Oct 2, 2012 at 11:11 PM, Piotr Jagielski
<[email protected] <mailto:[email protected]>> wrote:
>>>>
>>>> I haven't done extensive tests but one thing to
improve for sure is the abstract shortening algorithm.
You currently use a simple regex to solve a complex
problem of breaking down natural language text into
sentences. java.text.BreakIterator yields better results
and is also locale sensitive. You might also want to take
a look at more advanced boundary analysis library at
http://userguide.icu-project.org/boundaryanalysis.
>>>>
>>>> Regards,
>>>> Piotr
--
Kontokostas Dimitris
--
---
Pablo N. Mendes
http://pablomendes.com
Events: http://wole2012.eurecom.fr <http://wole2012.eurecom.fr/>
--
Kontokostas Dimitris
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion