I haven't done extensive tests but one thing to improve for sure is the
abstract shortening algorithm. You currently use a simple regex to solve
a complex problem of breaking down natural language text into sentences.
java.text.BreakIterator yields better results and is also locale
sensitive. You might also want to take a look at more advanced boundary
analysis library at http://userguide.icu-project.org/boundaryanalysis.
Regards,
Piotr
On 2012-10-01 07:42, Dimitris Kontokostas wrote:
Hi Piotr,
Thank you for the patch, Although it catches an error case, it seems
safe to be included in the framework.
About the PageNode Abstracts, can you give us a quality feedback? It
is something we always wanted to test but couldn't find the time.
Best,
Dimitris
On Fri, Sep 28, 2012 at 5:57 PM, Piotr Jagielski
<[email protected] <mailto:[email protected]>> wrote:
OK, I submitted a bug with proposed fix and test cases at
https://sourceforge.net/tracker/?func=detail&aid=3572779&group_id=190976&atid=935521.
Thanks for the link to documentation. Now I know where the
confusion came from. I should have mentioned that I tweaked the
code locally a little bit in order to generate abstracts without a
local MediaWiki instance :-) I used SimpleWikiParser to create
PageNode to pass to AbstractExctractor. The issue is in
SimpleWikiParser.
Piotr
On 2012-09-13 11:51, Pablo N. Mendes wrote:
This question keeps coming up, so I added hints to the
documentation.
4.3. Running Abstract Extraction
http://wiki.dbpedia.org/Documentation#h25-8
Cheers,
Pablo
On Thu, Sep 13, 2012 at 7:13 AM, Dimitris Kontokostas
<[email protected] <mailto:[email protected]>> wrote:
Hi Piotr,
We will happily accept you patch :)
You can take a look at [1] & [2] for more details on abstract
extraction.
Best,
Dimitris
[1]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/d580c99b5bbc/core/src/main/scala/org/dbpedia/extraction/mappings/AbstractExtractor.scala#l66
[2]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/efc0afb0faa3/abstractExtraction/README.txt
On Wed, Sep 12, 2012 at 10:37 PM, Piotr Jagielski
<[email protected] <mailto:[email protected]>> wrote:
Dimiris,
I guess I'm confused about the project structure. I
looked at AbstractExtractor.scala. It clearly uses
PageNode to figure out what the abstract is and I figured
out that PageNode is created by SimpleWikiParser. I now
see that there is some PHP code for a lot of stuff
including abstract extraction. I don't understand the
relationship between Scala extraction framework and PHP
code and I'm wondering if you mean the latter when you
refer to "modified mediawiki installation". When I used
AbstractExtractor.scala to generate the abstract for
http://pl.dbpedia.org/page/Agnieszka_Rylik I got similar
result because of a strangely formatted template not
parsed correctly.
Anyway, I can now access the bug tracker so I will submit
a patch there.
Regards,
Piotr
On 2012-09-11 08:39, Dimitris Kontokostas wrote:
Hi Piotr,
Any contribution is always welcome! However, the case
you are referring seems strange.
Abstracts are not generated by the SimpleWikiParser,
they are produced by a local wikipedia clone using a
modified mediawiki installation.
Best,
Dimitris
On Mon, Sep 10, 2012 at 7:30 PM, Piotr Jagielski
<[email protected] <mailto:[email protected]>>
wrote:
Any thoughts on this? I wrote some test cases and a
fix that I can
contribute in case you are interested.
Piotr
On 2012-09-06 01:13, Piotr Jagielski wrote:
> Hello,
>
> There is an issue with SimpleWikiParser in
extraction framework
> regarding template parsing. Strangely formatted
templates like this one:
> {{template | value |= }} are not parsed as
templates nodes but text
> nodes instead. Apart from preventing data
extraction it results in
> incorrect abstracts on Polish Dbpedia. For example on
> http://pl.dbpedia.org/page/Agnieszka_Rylik the
abstract contains infobox
> parameter values.
>
> BTW, I noticed a couple of issues I when trying to
report this issue.
> 1) I couldn't submit a bug on SourceForge at
>
https://sourceforge.net/tracker/?group_id=190976&atid=935520.
I got
> permission denied error. Is there any reason to
restrict bug reporting
> to project members only?
> 2) I wanted to created a test case for it but I
couldn't find any tests
> for the parser part in the repository. Are there any?
>
> Regards,
> Piotr
>
>
------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways
today's security and
> threat landscape has changed and how IT managers
can respond. Discussions
> will include endpoint security, mobile security
and the latest in malware
> threats.
http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
<mailto:[email protected]>
>
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's
security and
threat landscape has changed and how IT managers can
respond. Discussions
will include endpoint security, mobile security and
the latest in malware
threats.
http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
--
Kontokostas Dimitris
--
Kontokostas Dimitris
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond.
Discussions
will include endpoint security, mobile security and the
latest in malware
threats.
http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
--
---
Pablo N. Mendes
http://pablomendes.com
Events: http://wole2012.eurecom.fr <http://wole2012.eurecom.fr/>
--
Kontokostas Dimitris
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion