On Wed, Oct 3, 2012 at 12:42 AM, Piotr Jagielski <[email protected]>wrote:
> What do you mean by text quality? The text itself is as good as the
> first couple of sentences in the Wikipedia article you take it from, right?
>
Well, that is what I am asking :) Is it (exactly) the same text?
The problem is with some templates that render text (i.e. date templates)
If we can measure their usage extend we could see if this is the way to go.
Best,
Dimitris
>
> Piotr
>
>
> On 2012-10-02 22:49, Dimitris Kontokostas wrote:
>
> Our main interest is the text quality, if we get this right the shortening
> / tweaking should be the easy part :)
>
> Could you please give us with some text quality feedback and if it is good
> maybe we can start testing it to other languages as well
>
> Best,
> Dimitris
>
> On Tue, Oct 2, 2012 at 11:11 PM, Piotr Jagielski <[email protected]>wrote:
>
>> I haven't done extensive tests but one thing to improve for sure is the
>> abstract shortening algorithm. You currently use a simple regex to solve a
>> complex problem of breaking down natural language text into sentences.
>> java.text.BreakIterator yields better results and is also locale sensitive.
>> You might also want to take a look at more advanced boundary analysis
>> library at http://userguide.icu-project.org/boundaryanalysis.
>>
>> Regards,
>> Piotr
>>
>>
>> On 2012-10-01 07:42, Dimitris Kontokostas wrote:
>>
>> Hi Piotr,
>>
>> Thank you for the patch, Although it catches an error case, it seems safe
>> to be included in the framework.
>> About the PageNode Abstracts, can you give us a quality feedback? It is
>> something we always wanted to test but couldn't find the time.
>>
>> Best,
>> Dimitris
>>
>> On Fri, Sep 28, 2012 at 5:57 PM, Piotr Jagielski
>> <[email protected]>wrote:
>>
>>> OK, I submitted a bug with proposed fix and test cases at
>>> https://sourceforge.net/tracker/?func=detail&aid=3572779&group_id=190976&atid=935521
>>> .
>>>
>>> Thanks for the link to documentation. Now I know where the confusion
>>> came from. I should have mentioned that I tweaked the code locally a little
>>> bit in order to generate abstracts without a local MediaWiki instance :-) I
>>> used SimpleWikiParser to create PageNode to pass to AbstractExctractor. The
>>> issue is in SimpleWikiParser.
>>>
>>> Piotr
>>>
>>>
>>> On 2012-09-13 11:51, Pablo N. Mendes wrote:
>>>
>>>
>>> This question keeps coming up, so I added hints to the documentation.
>>>
>>> 4.3. Running Abstract Extraction
>>> http://wiki.dbpedia.org/Documentation#h25-8
>>>
>>> Cheers,
>>> Pablo
>>>
>>> On Thu, Sep 13, 2012 at 7:13 AM, Dimitris Kontokostas <[email protected]
>>> > wrote:
>>>
>>>> Hi Piotr,
>>>>
>>>> We will happily accept you patch :)
>>>> You can take a look at [1] & [2] for more details on abstract
>>>> extraction.
>>>>
>>>> Best,
>>>> Dimitris
>>>>
>>>> [1]
>>>> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/d580c99b5bbc/core/src/main/scala/org/dbpedia/extraction/mappings/AbstractExtractor.scala#l66
>>>> [2]
>>>> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/efc0afb0faa3/abstractExtraction/README.txt
>>>>
>>>>
>>>> On Wed, Sep 12, 2012 at 10:37 PM, Piotr Jagielski <
>>>> [email protected]> wrote:
>>>>
>>>>> Dimiris,
>>>>>
>>>>> I guess I'm confused about the project structure. I looked at
>>>>> AbstractExtractor.scala. It clearly uses PageNode to figure out what the
>>>>> abstract is and I figured out that PageNode is created by
>>>>> SimpleWikiParser.
>>>>> I now see that there is some PHP code for a lot of stuff including
>>>>> abstract
>>>>> extraction. I don't understand the relationship between Scala extraction
>>>>> framework and PHP code and I'm wondering if you mean the latter when you
>>>>> refer to "modified mediawiki installation". When I used
>>>>> AbstractExtractor.scala to generate the abstract for
>>>>> http://pl.dbpedia.org/page/Agnieszka_Rylik I got similar result
>>>>> because of a strangely formatted template not parsed correctly.
>>>>>
>>>>> Anyway, I can now access the bug tracker so I will submit a patch
>>>>> there.
>>>>>
>>>>> Regards,
>>>>> Piotr
>>>>>
>>>>>
>>>>>
>>>>> On 2012-09-11 08:39, Dimitris Kontokostas wrote:
>>>>>
>>>>> Hi Piotr,
>>>>>
>>>>> Any contribution is always welcome! However, the case you are
>>>>> referring seems strange.
>>>>> Abstracts are not generated by the SimpleWikiParser, they are produced
>>>>> by a local wikipedia clone using a modified mediawiki installation.
>>>>>
>>>>> Best,
>>>>> Dimitris
>>>>>
>>>>> On Mon, Sep 10, 2012 at 7:30 PM, Piotr Jagielski <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Any thoughts on this? I wrote some test cases and a fix that I can
>>>>>> contribute in case you are interested.
>>>>>>
>>>>>> Piotr
>>>>>>
>>>>>> On 2012-09-06 01:13, Piotr Jagielski wrote:
>>>>>> > Hello,
>>>>>> >
>>>>>> > There is an issue with SimpleWikiParser in extraction framework
>>>>>> > regarding template parsing. Strangely formatted templates like this
>>>>>> one:
>>>>>> > {{template | value |= }} are not parsed as templates nodes but text
>>>>>> > nodes instead. Apart from preventing data extraction it results in
>>>>>> > incorrect abstracts on Polish Dbpedia. For example on
>>>>>> > http://pl.dbpedia.org/page/Agnieszka_Rylik the abstract contains
>>>>>> infobox
>>>>>> > parameter values.
>>>>>> >
>>>>>> > BTW, I noticed a couple of issues I when trying to report this
>>>>>> issue.
>>>>>> > 1) I couldn't submit a bug on SourceForge at
>>>>>> > https://sourceforge.net/tracker/?group_id=190976&atid=935520. I got
>>>>>> > permission denied error. Is there any reason to restrict bug
>>>>>> reporting
>>>>>> > to project members only?
>>>>>> > 2) I wanted to created a test case for it but I couldn't find any
>>>>>> tests
>>>>>> > for the parser part in the repository. Are there any?
>>>>>> >
>>>>>> > Regards,
>>>>>> > Piotr
>>>>>> >
>>>>>> >
>>>>>> ------------------------------------------------------------------------------
>>>>>> > Live Security Virtual Conference
>>>>>> > Exclusive live event will cover all the ways today's security and
>>>>>> > threat landscape has changed and how IT managers can respond.
>>>>>> Discussions
>>>>>> > will include endpoint security, mobile security and the latest in
>>>>>> malware
>>>>>> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>>>>> > _______________________________________________
>>>>>> > Dbpedia-discussion mailing list
>>>>>> > [email protected]
>>>>>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Live Security Virtual Conference
>>>>>> Exclusive live event will cover all the ways today's security and
>>>>>> threat landscape has changed and how IT managers can respond.
>>>>>> Discussions
>>>>>> will include endpoint security, mobile security and the latest in
>>>>>> malware
>>>>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>>>>> _______________________________________________
>>>>>> Dbpedia-discussion mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kontokostas Dimitris
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kontokostas Dimitris
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Live Security Virtual Conference
>>>> Exclusive live event will cover all the ways today's security and
>>>> threat landscape has changed and how IT managers can respond.
>>>> Discussions
>>>> will include endpoint security, mobile security and the latest in
>>>> malware
>>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>>> _______________________________________________
>>>> Dbpedia-discussion mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Pablo N. Mendes
>>> http://pablomendes.com
>>> Events: http://wole2012.eurecom.fr
>>>
>>>
>>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>>
>>
>
>
> --
> Kontokostas Dimitris
>
>
>
--
Kontokostas Dimitris
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion