I haven't done extensive tests but one thing to improve for sure is the abstract shortening algorithm. You currently use a simple regex to solve a complex problem of breaking down natural language text into sentences. java.text.BreakIterator yields better results and is also locale sensitive. You might also want to take a look at more advanced boundary analysis library at http://userguide.icu-project.org/boundaryanalysis.

Regards,
Piotr

On 2012-10-01 07:42, Dimitris Kontokostas wrote:
Hi Piotr,

Thank you for the patch, Although it catches an error case, it seems safe to be included in the framework. About the PageNode Abstracts, can you give us a quality feedback? It is something we always wanted to test but couldn't find the time.

Best,
Dimitris

On Fri, Sep 28, 2012 at 5:57 PM, Piotr Jagielski <[email protected] <mailto:[email protected]>> wrote:

    OK, I submitted a bug with proposed fix and test cases at
    
https://sourceforge.net/tracker/?func=detail&aid=3572779&group_id=190976&atid=935521.

    Thanks for the link to documentation. Now I know where the
    confusion came from. I should have mentioned that I tweaked the
    code locally a little bit in order to generate abstracts without a
    local MediaWiki instance :-) I used SimpleWikiParser to create
    PageNode to pass to AbstractExctractor. The issue is in
    SimpleWikiParser.

    Piotr


    On 2012-09-13 11:51, Pablo N. Mendes wrote:

    This question keeps coming up, so I added hints to the
    documentation.

    4.3. Running Abstract Extraction
    http://wiki.dbpedia.org/Documentation#h25-8

    Cheers,
    Pablo

    On Thu, Sep 13, 2012 at 7:13 AM, Dimitris Kontokostas
    <[email protected] <mailto:[email protected]>> wrote:

        Hi Piotr,

        We will happily accept you patch :)
        You can take a look at [1] & [2] for more details on abstract
        extraction.

        Best,
        Dimitris

        [1]
        
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/d580c99b5bbc/core/src/main/scala/org/dbpedia/extraction/mappings/AbstractExtractor.scala#l66
        [2]
        
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/efc0afb0faa3/abstractExtraction/README.txt



        On Wed, Sep 12, 2012 at 10:37 PM, Piotr Jagielski
        <[email protected] <mailto:[email protected]>> wrote:

            Dimiris,
            I guess I'm confused about the project structure. I
            looked at AbstractExtractor.scala. It clearly uses
            PageNode to figure out what the abstract is and I figured
            out that PageNode is created by SimpleWikiParser. I now
            see that there is some PHP code for a lot of stuff
            including abstract extraction. I don't understand the
            relationship between Scala extraction framework and PHP
            code and I'm wondering if you mean the latter when you
            refer to "modified mediawiki installation". When I used
            AbstractExtractor.scala to generate the abstract for
            http://pl.dbpedia.org/page/Agnieszka_Rylik I got similar
            result because of a strangely formatted template not
            parsed correctly.

            Anyway, I can now access the bug tracker so I will submit
            a patch there.
            Regards,
            Piotr



            On 2012-09-11 08:39, Dimitris Kontokostas wrote:
            Hi Piotr,

            Any contribution is always welcome! However, the case
            you are referring seems strange.
            Abstracts are not generated by the SimpleWikiParser,
            they are produced by a local wikipedia clone using a
            modified mediawiki installation.

            Best,
            Dimitris

            On Mon, Sep 10, 2012 at 7:30 PM, Piotr Jagielski
            <[email protected] <mailto:[email protected]>>
            wrote:

                Any thoughts on this? I wrote some test cases and a
                fix that I can
                contribute in case you are interested.

                Piotr

                On 2012-09-06 01:13, Piotr Jagielski wrote:
                > Hello,
                >
                > There is an issue with SimpleWikiParser in
                extraction framework
                > regarding template parsing. Strangely formatted
                templates like this one:
                > {{template | value |= }} are not parsed as
                templates nodes but text
                > nodes instead. Apart from preventing data
                extraction it results in
                > incorrect abstracts on Polish Dbpedia. For example on
                > http://pl.dbpedia.org/page/Agnieszka_Rylik the
                abstract contains infobox
                > parameter values.
                >
                > BTW, I noticed a couple of issues I when trying to
                report this issue.
                > 1) I couldn't submit a bug on SourceForge at
                >
                https://sourceforge.net/tracker/?group_id=190976&atid=935520.
                I got
                > permission denied error. Is there any reason to
                restrict bug reporting
                > to project members only?
                > 2) I wanted to created a test case for it but I
                couldn't find any tests
                > for the parser part in the repository. Are there any?
                >
                > Regards,
                > Piotr
                >
                >
                
------------------------------------------------------------------------------
                > Live Security Virtual Conference
                > Exclusive live event will cover all the ways
                today's security and
                > threat landscape has changed and how IT managers
                can respond. Discussions
                > will include endpoint security, mobile security
                and the latest in malware
                > threats.
                http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
                > _______________________________________________
                > Dbpedia-discussion mailing list
                > [email protected]
                <mailto:[email protected]>
                >
                https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
                >


                
------------------------------------------------------------------------------
                Live Security Virtual Conference
                Exclusive live event will cover all the ways today's
                security and
                threat landscape has changed and how IT managers can
                respond. Discussions
                will include endpoint security, mobile security and
                the latest in malware
                threats.
                http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
                _______________________________________________
                Dbpedia-discussion mailing list
                [email protected]
                <mailto:[email protected]>
                https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion




-- Kontokostas Dimitris




-- Kontokostas Dimitris

        
------------------------------------------------------------------------------
        Live Security Virtual Conference
        Exclusive live event will cover all the ways today's security and
        threat landscape has changed and how IT managers can respond.
        Discussions
        will include endpoint security, mobile security and the
        latest in malware
        threats.
        http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
        _______________________________________________
        Dbpedia-discussion mailing list
        [email protected]
        <mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion




-- ---
    Pablo N. Mendes
    http://pablomendes.com
    Events: http://wole2012.eurecom.fr <http://wole2012.eurecom.fr/>





--
Kontokostas Dimitris

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to