I don't have a concrete test-case, I have to search in blind.
What I was thinking is that if we could create the abstracts with exactly
the same way as the modified mw we could make a string comparison and test
how many are different and how. Depending on the number and frequency of
the text rendering templates that exist in the abstracts result we could
try to resolve them manually.
Removing the local Wikipedia mirror dependency for the extraction could be
a huge plus but we shouldn't compromise on quality.
Any other ideas?
Best,
Dimitris
On Wed, Oct 3, 2012 at 9:41 AM, Pablo N. Mendes <[email protected]>wrote:
>
> Perhaps it would help the discussion if we got more concrete. Dimitris, do
> you have a favorite abstract that is problematic (therefore justifies using
> the modified MediaWiki)? Perhaps you can paste the wiki markup source and
> the desired outcome and Piotr can respond with the rendering by his patch.
>
> On Oct 3, 2012 8:31 AM, "Dimitris Kontokostas" <[email protected]> wrote:
> >
> >
> > On Wed, Oct 3, 2012 at 12:42 AM, Piotr Jagielski <[email protected]>
> wrote:
> >>
> >> What do you mean by text quality? The text itself is as good as the
> first couple of sentences in the Wikipedia article you take it from, right?
> >
> >
> > Well, that is what I am asking :) Is it (exactly) the same text?
> > The problem is with some templates that render text (i.e. date
> templates) If we can measure their usage extend we could see if this is the
> way to go.
> >
> > Best,
> > Dimitris
> >
> >>
> >>
> >> Piotr
> >>
> >>
> >> On 2012-10-02 22:49, Dimitris Kontokostas wrote:
> >>>
> >>> Our main interest is the text quality, if we get this right the
> shortening / tweaking should be the easy part :)
> >>>
> >>> Could you please give us with some text quality feedback and if it is
> good maybe we can start testing it to other languages as well
> >>>
> >>> Best,
> >>> Dimitris
> >>>
> >>> On Tue, Oct 2, 2012 at 11:11 PM, Piotr Jagielski <
> [email protected]> wrote:
> >>>>
> >>>> I haven't done extensive tests but one thing to improve for sure is
> the abstract shortening algorithm. You currently use a simple regex to
> solve a complex problem of breaking down natural language text into
> sentences. java.text.BreakIterator yields better results and is also locale
> sensitive. You might also want to take a look at more advanced boundary
> analysis library at http://userguide.icu-project.org/boundaryanalysis.
> >>>>
> >>>> Regards,
> >>>> Piotr
>
--
Kontokostas Dimitris
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion