Re: [Dbpedia-discussion] Problem with extracted data

Jona Christopher Sahnwaldt Wed, 24 Apr 2013 11:52:36 -0700

Cool, thanks! Your code looks good. I peppered your pull request with
a few comments. None of them are major problems, but if you have time,
please have a look at them. If you don't have time, please copy the
comments into TODO comments in the code and we can fix them later.


On 24 April 2013 16:34, Julien Plu <[email protected]> wrote:
> You had right Jona the problem came from a bad path in the package.
>
> I just sent a new pull request with my extractor :-)
>
> Best.
>
> Julien.
>
>
> 2013/4/23 Jona Christopher Sahnwaldt <[email protected]>
>>
>> Hi Julien,
>>
>> On 23 April 2013 23:16, Julien Plu <[email protected]>
>> wrote:
>> > Ok, I finished, now I made an extractor which works like we expected :-)
>> > I
>> > don't think that what I did is well made but it works.
>>
>> Cool! You can always improve it later. :-)
>>
>> >
>> > Anyway only one problem stay, if I put my "PopulationExtractor.scala"
>> > file
>> > from "mappings" folder into "fr" folder inside "mappings" folder the
>> > extraction configuration file fail because he doesn't find the
>> > "PopulationExtractor" class doesn't matter if I write
>> > "fr.PopulationExtractor" or
>> > "org.dbpedia.extraction.mappings.fr.PopulationExtractor". Any idea of
>> > what's
>> > going on ?
>>
>> Does the package declaration in the class file include the ".fr"?
>> Scala is less strict than Java here.
>>
>> If you send a pull request, we can have a look at your code and merge
>> it into the main repository, so others can run this extraction as
>> well.
>>
>> https://github.com/dbpedia/extraction-framework/wiki/Contributing
>>
>> >
>> > Last thing I added a dataset inside the file "DBpediaDatasets.scala"
>> > like
>> > that I have my own archive containing only the population informations.
>>
>> Right, that's one more thing you need to add.
>>
>> Thanks!
>>
>> JC
>>
>>
>> >
>> > Best.
>> >
>> > Julien.
>> >
>> >
>> > 2013/4/23 Julien Plu <[email protected]>
>> >>
>> >> Yes I know IDE are really usefull but my working machine is on Windows
>> >> and
>> >> I'm really not familiar with. So I use a Linux distrib via a virtual
>> >> machine
>> >> but this virtual machine is too slow for coding with an IDE in graphics
>> >> so I
>> >> have to connect to this VM with a ssh connexion and use only the shell.
>> >>
>> >> I think that I will force me to use Windows that will be more easy than
>> >> to
>> >> continue to work like that :-D
>> >>
>> >> By the way I found my problem for the code. I was come from my regex,
>> >> so
>> >> instead to use """|pop=(\d+)""".r I use """pop=(\d+)""".r and now I
>> >> have the
>> >> good value that I want :-)
>> >>
>> >> Best.
>> >>
>> >> Julien.
>> >>
>> >>
>> >> 2013/4/23 Dimitris Kontokostas <[email protected]>
>> >>>
>> >>> You should use an IDE for this,it will make you life a lot easier ;)
>> >>> I use the intelliJ IDEA default debugger and works pretty good. I
>> >>> could
>> >>> send you instructions to set it up
>> >>>
>> >>> Best,
>> >>> Dimtiris
>> >>>
>> >>>
>> >>> On Tue, Apr 23, 2013 at 3:59 PM, Julien Plu
>> >>> <[email protected]> wrote:
>> >>>>
>> >>>> No I don't have a debugger because I'm coding on a remote machine via
>> >>>> ssh.
>> >>>>
>> >>>> And even with this code :
>> >>>>
>> >>>>
>> >>>> override def extract(page: PageNode, subjectUri: String, pageContext:
>> >>>> PageContext): Seq[Quad] = {
>> >>>>      if (page.title.namespace != Namespace.Template ||
>> >>>> page.isRedirect
>> >>>> || !page.title.decoded.contains("évolution population")) return
>> >>>> Seq.empty
>> >>>>
>> >>>>     for (property <- findPropertyNodes(page)) {
>> >>>>         println(property.toWikiText)
>> >>>>     }
>> >>>> }
>> >>>> private def findPropertyNodes(node : Node) : List[PropertyNode] = {
>> >>>>
>> >>>>     node match {
>> >>>>         case propertyNode : PropertyNode => List(propertyNode)
>> >>>>         case _ = node.children.flatMap(findPropertyNodes)
>> >>>> }
>> >>>>
>> >>>> Absolutely nothing is displayed, because the list returned by
>> >>>> "findPropertyNodes" is empty and I don't understand why. I know she's
>> >>>> empty
>> >>>> because if I do that :
>> >>>>
>> >>>> if (findPropertyNodes(page).isEmpty) {
>> >>>>     println("empty")
>> >>>> }
>> >>>> else {
>> >>>>     println("no empty")
>> >>>> }
>> >>>>
>> >>>> And "empty" is displayed whereas if I display "page.children" I have
>> >>>> all
>> >>>> the template code but the "findPropertyNodes" function doesn't find
>> >>>> property
>> >>>> inside this template code :-(
>> >>>>
>> >>>> Best.
>> >>>>
>> >>>> Julien.
>> >>>>
>> >>>>
>> >>>>
>> >>>> 2013/4/23 Jona Christopher Sahnwaldt <[email protected]>
>> >>>>>
>> >>>>> On 23 April 2013 12:01, Julien Plu
>> >>>>> <[email protected]> wrote:
>> >>>>> > Sorry but I really don't understand how AST works (and Scala too)
>> >>>>> > I
>> >>>>> > try to
>> >>>>> > retrieve all the PropertyNode contained in a PageNode so I do :
>> >>>>> >
>> >>>>> >
>> >>>>> > override def extract(page: PageNode, subjectUri: String,
>> >>>>> > pageContext:
>> >>>>> > PageContext): Seq[Quad] = {
>> >>>>> >     if (page.title.namespace != Namespace.Template ||
>> >>>>> > page.isRedirect
>> >>>>> > ||
>> >>>>> > !page.title.decoded.contains("évolution population")) return
>> >>>>> > Seq.empty
>> >>>>> >
>> >>>>>
>> >>>>> I think it would be good if you could get a picture of the structure
>> >>>>> of the tree. It's usually not complicated, but a bit hard to explain
>> >>>>> in text. Can you use a debugger? If so, set a breakpoint at the
>> >>>>> following line and let the debugger show the page variable. Then
>> >>>>> click
>> >>>>> into it, look at its children, and so on.
>> >>>>>
>> >>>>> We should add a toString() method to Node.scala (and some
>> >>>>> sub-classes)
>> >>>>> that shows the structure.
>> >>>>>
>> >>>>> >     for (node <- page.children) {
>> >>>>> >         for (property <- allPropertiesNode(node)) {
>> >>>>> >             println(property.toWikiText)
>> >>>>> >         }
>> >>>>> >     }
>> >>>>> > }
>> >>>>> >
>> >>>>> > private def allPropertiesNode(node : Node) : List[PropertyNode] =
>> >>>>> > {
>> >>>>> >     node match {
>> >>>>> >         case propertyNode : PropertyNode => List(propertyNode)
>> >>>>> >         case _ = node.children
>> >>>>> >    }
>> >>>>>
>> >>>>> This is almost right. If I understand correctly, you want to walk
>> >>>>> through the whole tree and collect all property nodes. Change this
>> >>>>> line:
>> >>>>>
>> >>>>>     case _ = node.children
>> >>>>>
>> >>>>> (does that even compile? I don't understand how... :-) ) to
>> >>>>>
>> >>>>>     case _ => node.children.flatMap(allPropertiesNode)
>> >>>>>
>> >>>>> (I think that should work, I'm not 100% sure.)
>> >>>>>
>> >>>>> Oh by the way, the method name should be allPropertyNodes. :-) Or
>> >>>>> maybe findPropertyNodes is even better.
>> >>>>>
>> >>>>> Once the method works, you can drop the main loop in extract().
>> >>>>> Instead
>> >>>>> of
>> >>>>>
>> >>>>> for (node <- page.children) {
>> >>>>>     for (property <- allPropertiesNode(node)) {
>> >>>>>         println(property.toWikiText)
>> >>>>>     }
>> >>>>> }
>> >>>>>
>> >>>>> you can just write
>> >>>>>
>> >>>>> for (property <- findPropertyNodes(page)) {
>> >>>>>     println(property.toWikiText)
>> >>>>> }
>> >>>>>
>> >>>>> But that's just cosmetic surgery, it has the same effect.
>> >>>>>
>> >>>>> Cheers,
>> >>>>> JC
>> >>>>>
>> >>>>> > }
>> >>>>> >
>> >>>>> >
>> >>>>> > And nothing is displayed on my screen :-(
>> >>>>> >
>> >>>>> > Any idea of what I do wrongly ?
>> >>>>> >
>> >>>>> > BesT.
>> >>>>> >
>> >>>>> > Julien.
>> >>>>> >
>> >>>>> >
>> >>>>> > 2013/4/23 Julien Plu <[email protected]>
>> >>>>> >>
>> >>>>> >> Hi,
>> >>>>> >>
>> >>>>> >> param come from a bad copy paste, it's "pop" the good variable.
>> >>>>> >>
>> >>>>> >> By the way thank you for the hint about AST I will take a look at
>> >>>>> >> these
>> >>>>> >> class and see how I can use them. I won't hesitate to ask if I'm
>> >>>>> >> blocked :-)
>> >>>>> >>
>> >>>>> >> Best.
>> >>>>> >>
>> >>>>> >> Julien.
>> >>>>> >>
>> >>>>> >>
>> >>>>> >> 2013/4/22 Jona Christopher Sahnwaldt <[email protected]>
>> >>>>> >>>
>> >>>>> >>> Hi Julien,
>> >>>>> >>>
>> >>>>> >>> On 22 April 2013 21:43, Julien Plu
>> >>>>> >>> <[email protected]>
>> >>>>> >>> wrote:
>> >>>>> >>> > I started the code for the extractor and I have a problem with
>> >>>>> >>> > the
>> >>>>> >>> > regex in
>> >>>>> >>> > Scala. the string is :
>> >>>>> >>> >
>> >>>>> >>> >
>> >>>>> >>> >
>> >>>>> >>> > http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit
>> >>>>> >>> >
>> >>>>> >>> > And my regex is : val populationRegex = """|pop=(\d+)""".r
>> >>>>> >>> >
>> >>>>> >>> > And I use this piece of code :
>> >>>>> >>> >
>> >>>>> >>> > populationRegex findAllIn  page.children.toString foreach (_
>> >>>>> >>> > match {
>> >>>>> >>> >     case populationRegex (pop) => println(page.title.decoded +
>> >>>>> >>> > "
>> >>>>> >>> > : pop
>> >>>>> >>> > : " +
>> >>>>> >>> > param)
>> >>>>> >>>
>> >>>>> >>> What is param?
>> >>>>> >>>
>> >>>>> >>> But more generally - did you try using the AST (abstract syntax
>> >>>>> >>> tree)
>> >>>>> >>> built by the parser, i.e. the tree whose root node is the
>> >>>>> >>> PageNode?
>> >>>>> >>> I'm not sure how good our parser is at dealing with stuff like
>> >>>>> >>> "<includeonly>" and "{{#switch ...}}", but I think it works and
>> >>>>> >>> page.children should contain a ParserFunctionNode [1] object for
>> >>>>> >>> the
>> >>>>> >>> #switch, which in turn has a child for each branch, e.g. one
>> >>>>> >>> child
>> >>>>> >>> for
>> >>>>> >>> an=2010 and one for pop=61793. These children are PropertyNode
>> >>>>> >>> [2]
>> >>>>> >>> objects, which have a key and (who would have thought) more
>> >>>>> >>> children.
>> >>>>> >>> Well, in this case, just one child, which is a TextNode. In a
>> >>>>> >>> nutshell: Find the "#switch" node, find children with keys "an"
>> >>>>> >>> and
>> >>>>> >>> "pop", and generate triples for their values.
>> >>>>> >>>
>> >>>>> >>> >     case _ =>
>> >>>>> >>> > })
>> >>>>> >>> >
>> >>>>> >>> > And instead of to get : "Données/Antony/évolution population :
>> >>>>> >>> > pop :
>> >>>>> >>> > 61793"
>> >>>>> >>> > just once
>> >>>>> >>> >
>> >>>>> >>> > I have many : "Données/Antony/évolution population : pop :
>> >>>>> >>> > null"
>> >>>>> >>> > as
>> >>>>> >>> > much as
>> >>>>> >>> > there is line in the string
>> >>>>> >>> >
>> >>>>> >>> > An idea of what I do wrongly ?
>> >>>>> >>> >
>> >>>>> >>> > I'm totally beginner in Scala :-( sorry.
>> >>>>> >>>
>> >>>>> >>> Your code excerpt looks pretty good to me. :-)
>> >>>>> >>>
>> >>>>> >>> The AST is usually much safer and cleaner than regexes. Regexes
>> >>>>> >>> are
>> >>>>> >>> more suitable for unstructured strings, but here you're dealing
>> >>>>> >>> with
>> >>>>> >>> pretty clean structures. So I would suggest you write some code
>> >>>>> >>> that
>> >>>>> >>> walks through the PageNode tree. If you have any questions,
>> >>>>> >>> don't
>> >>>>> >>> hesitate to ask. We're looking forward to your contributions.
>> >>>>> >>> Thanks!
>> >>>>> >>>
>> >>>>> >>> Cheers,
>> >>>>> >>> JC
>> >>>>> >>>
>> >>>>> >>> [1]
>> >>>>> >>>
>> >>>>> >>>
>> >>>>> >>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala
>> >>>>> >>> [2]
>> >>>>> >>>
>> >>>>> >>>
>> >>>>> >>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala
>> >>>>> >>>
>> >>>>> >>> >
>> >>>>> >>> > Best.
>> >>>>> >>> >
>> >>>>> >>> > Julien.
>> >>>>> >>> >
>> >>>>> >>> >
>> >>>>> >>> > 2013/4/22 Jona Christopher Sahnwaldt <[email protected]>
>> >>>>> >>> >>
>> >>>>> >>> >> The templates where data is stored are not used directly in
>> >>>>> >>> >> the
>> >>>>> >>> >> main
>> >>>>> >>> >> pages. It's a complicated process: page Toulouse uses
>> >>>>> >>> >> template
>> >>>>> >>> >> X, X
>> >>>>> >>> >> uses Y,
>> >>>>> >>> >> Y uses Z, and Z contains the data. Something like that, I'm
>> >>>>> >>> >> 100%
>> >>>>> >>> >> sure,
>> >>>>> >>> >> but
>> >>>>> >>> >> the details don't matter. This means that
>> >>>>> >>> >> wikiPageUsesTemplate
>> >>>>> >>> >> and
>> >>>>> >>> >> InfoboxExtractor won't help.
>> >>>>> >>> >>
>> >>>>> >>> >> Generating a separate file is probably the best idea. We
>> >>>>> >>> >> could
>> >>>>> >>> >> also
>> >>>>> >>> >> send
>> >>>>> >>> >> these new triples to the main mapping based file, but that
>> >>>>> >>> >> might
>> >>>>> >>> >> be
>> >>>>> >>> >> confusing: first, they're not mapping based; second, new
>> >>>>> >>> >> triples
>> >>>>> >>> >> about
>> >>>>> >>> >> a
>> >>>>> >>> >> city would be added in a completely different place in the
>> >>>>> >>> >> file.
>> >>>>> >>> >> (That's not
>> >>>>> >>> >> a big problem though.)
>> >>>>> >>> >>
>> >>>>> >>> >> Cheers,
>> >>>>> >>> >> JC
>> >>>>> >>> >
>> >>>>> >>> >
>> >>>>> >>
>> >>>>> >>
>> >>>>> >
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Kontokostas Dimitris
>> >>
>> >>
>> >
>
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Problem with extracted data

Reply via email to