Re: [Dbpedia-discussion] Problem with extracted data

Jona Christopher Sahnwaldt Tue, 23 Apr 2013 14:28:06 -0700

Hi Julien,

On 23 April 2013 23:16, Julien Plu <[email protected]> wrote:
> Ok, I finished, now I made an extractor which works like we expected :-) I
> don't think that what I did is well made but it works.


Cool! You can always improve it later. :-)

>
> Anyway only one problem stay, if I put my "PopulationExtractor.scala" file
> from "mappings" folder into "fr" folder inside "mappings" folder the
> extraction configuration file fail because he doesn't find the
> "PopulationExtractor" class doesn't matter if I write
> "fr.PopulationExtractor" or
> "org.dbpedia.extraction.mappings.fr.PopulationExtractor". Any idea of what's
> going on ?

Does the package declaration in the class file include the ".fr"?
Scala is less strict than Java here.

If you send a pull request, we can have a look at your code and merge
it into the main repository, so others can run this extraction as
well.

https://github.com/dbpedia/extraction-framework/wiki/Contributing

>
> Last thing I added a dataset inside the file "DBpediaDatasets.scala" like
> that I have my own archive containing only the population informations.

Right, that's one more thing you need to add.

Thanks!

JC


>
> Best.
>
> Julien.
>
>
> 2013/4/23 Julien Plu <[email protected]>
>>
>> Yes I know IDE are really usefull but my working machine is on Windows and
>> I'm really not familiar with. So I use a Linux distrib via a virtual machine
>> but this virtual machine is too slow for coding with an IDE in graphics so I
>> have to connect to this VM with a ssh connexion and use only the shell.
>>
>> I think that I will force me to use Windows that will be more easy than to
>> continue to work like that :-D
>>
>> By the way I found my problem for the code. I was come from my regex, so
>> instead to use """|pop=(\d+)""".r I use """pop=(\d+)""".r and now I have the
>> good value that I want :-)
>>
>> Best.
>>
>> Julien.
>>
>>
>> 2013/4/23 Dimitris Kontokostas <[email protected]>
>>>
>>> You should use an IDE for this,it will make you life a lot easier ;)
>>> I use the intelliJ IDEA default debugger and works pretty good. I could
>>> send you instructions to set it up
>>>
>>> Best,
>>> Dimtiris
>>>
>>>
>>> On Tue, Apr 23, 2013 at 3:59 PM, Julien Plu
>>> <[email protected]> wrote:
>>>>
>>>> No I don't have a debugger because I'm coding on a remote machine via
>>>> ssh.
>>>>
>>>> And even with this code :
>>>>
>>>>
>>>> override def extract(page: PageNode, subjectUri: String, pageContext:
>>>> PageContext): Seq[Quad] = {
>>>>      if (page.title.namespace != Namespace.Template || page.isRedirect
>>>> || !page.title.decoded.contains("évolution population")) return Seq.empty
>>>>
>>>>     for (property <- findPropertyNodes(page)) {
>>>>         println(property.toWikiText)
>>>>     }
>>>> }
>>>> private def findPropertyNodes(node : Node) : List[PropertyNode] = {
>>>>
>>>>     node match {
>>>>         case propertyNode : PropertyNode => List(propertyNode)
>>>>         case _ = node.children.flatMap(findPropertyNodes)
>>>> }
>>>>
>>>> Absolutely nothing is displayed, because the list returned by
>>>> "findPropertyNodes" is empty and I don't understand why. I know she's empty
>>>> because if I do that :
>>>>
>>>> if (findPropertyNodes(page).isEmpty) {
>>>>     println("empty")
>>>> }
>>>> else {
>>>>     println("no empty")
>>>> }
>>>>
>>>> And "empty" is displayed whereas if I display "page.children" I have all
>>>> the template code but the "findPropertyNodes" function doesn't find 
>>>> property
>>>> inside this template code :-(
>>>>
>>>> Best.
>>>>
>>>> Julien.
>>>>
>>>>
>>>>
>>>> 2013/4/23 Jona Christopher Sahnwaldt <[email protected]>
>>>>>
>>>>> On 23 April 2013 12:01, Julien Plu
>>>>> <[email protected]> wrote:
>>>>> > Sorry but I really don't understand how AST works (and Scala too) I
>>>>> > try to
>>>>> > retrieve all the PropertyNode contained in a PageNode so I do :
>>>>> >
>>>>> >
>>>>> > override def extract(page: PageNode, subjectUri: String, pageContext:
>>>>> > PageContext): Seq[Quad] = {
>>>>> >     if (page.title.namespace != Namespace.Template || page.isRedirect
>>>>> > ||
>>>>> > !page.title.decoded.contains("évolution population")) return
>>>>> > Seq.empty
>>>>> >
>>>>>
>>>>> I think it would be good if you could get a picture of the structure
>>>>> of the tree. It's usually not complicated, but a bit hard to explain
>>>>> in text. Can you use a debugger? If so, set a breakpoint at the
>>>>> following line and let the debugger show the page variable. Then click
>>>>> into it, look at its children, and so on.
>>>>>
>>>>> We should add a toString() method to Node.scala (and some sub-classes)
>>>>> that shows the structure.
>>>>>
>>>>> >     for (node <- page.children) {
>>>>> >         for (property <- allPropertiesNode(node)) {
>>>>> >             println(property.toWikiText)
>>>>> >         }
>>>>> >     }
>>>>> > }
>>>>> >
>>>>> > private def allPropertiesNode(node : Node) : List[PropertyNode] = {
>>>>> >     node match {
>>>>> >         case propertyNode : PropertyNode => List(propertyNode)
>>>>> >         case _ = node.children
>>>>> >    }
>>>>>
>>>>> This is almost right. If I understand correctly, you want to walk
>>>>> through the whole tree and collect all property nodes. Change this
>>>>> line:
>>>>>
>>>>>     case _ = node.children
>>>>>
>>>>> (does that even compile? I don't understand how... :-) ) to
>>>>>
>>>>>     case _ => node.children.flatMap(allPropertiesNode)
>>>>>
>>>>> (I think that should work, I'm not 100% sure.)
>>>>>
>>>>> Oh by the way, the method name should be allPropertyNodes. :-) Or
>>>>> maybe findPropertyNodes is even better.
>>>>>
>>>>> Once the method works, you can drop the main loop in extract(). Instead
>>>>> of
>>>>>
>>>>> for (node <- page.children) {
>>>>>     for (property <- allPropertiesNode(node)) {
>>>>>         println(property.toWikiText)
>>>>>     }
>>>>> }
>>>>>
>>>>> you can just write
>>>>>
>>>>> for (property <- findPropertyNodes(page)) {
>>>>>     println(property.toWikiText)
>>>>> }
>>>>>
>>>>> But that's just cosmetic surgery, it has the same effect.
>>>>>
>>>>> Cheers,
>>>>> JC
>>>>>
>>>>> > }
>>>>> >
>>>>> >
>>>>> > And nothing is displayed on my screen :-(
>>>>> >
>>>>> > Any idea of what I do wrongly ?
>>>>> >
>>>>> > BesT.
>>>>> >
>>>>> > Julien.
>>>>> >
>>>>> >
>>>>> > 2013/4/23 Julien Plu <[email protected]>
>>>>> >>
>>>>> >> Hi,
>>>>> >>
>>>>> >> param come from a bad copy paste, it's "pop" the good variable.
>>>>> >>
>>>>> >> By the way thank you for the hint about AST I will take a look at
>>>>> >> these
>>>>> >> class and see how I can use them. I won't hesitate to ask if I'm
>>>>> >> blocked :-)
>>>>> >>
>>>>> >> Best.
>>>>> >>
>>>>> >> Julien.
>>>>> >>
>>>>> >>
>>>>> >> 2013/4/22 Jona Christopher Sahnwaldt <[email protected]>
>>>>> >>>
>>>>> >>> Hi Julien,
>>>>> >>>
>>>>> >>> On 22 April 2013 21:43, Julien Plu
>>>>> >>> <[email protected]>
>>>>> >>> wrote:
>>>>> >>> > I started the code for the extractor and I have a problem with
>>>>> >>> > the
>>>>> >>> > regex in
>>>>> >>> > Scala. the string is :
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit
>>>>> >>> >
>>>>> >>> > And my regex is : val populationRegex = """|pop=(\d+)""".r
>>>>> >>> >
>>>>> >>> > And I use this piece of code :
>>>>> >>> >
>>>>> >>> > populationRegex findAllIn  page.children.toString foreach (_
>>>>> >>> > match {
>>>>> >>> >     case populationRegex (pop) => println(page.title.decoded + "
>>>>> >>> > : pop
>>>>> >>> > : " +
>>>>> >>> > param)
>>>>> >>>
>>>>> >>> What is param?
>>>>> >>>
>>>>> >>> But more generally - did you try using the AST (abstract syntax
>>>>> >>> tree)
>>>>> >>> built by the parser, i.e. the tree whose root node is the PageNode?
>>>>> >>> I'm not sure how good our parser is at dealing with stuff like
>>>>> >>> "<includeonly>" and "{{#switch ...}}", but I think it works and
>>>>> >>> page.children should contain a ParserFunctionNode [1] object for
>>>>> >>> the
>>>>> >>> #switch, which in turn has a child for each branch, e.g. one child
>>>>> >>> for
>>>>> >>> an=2010 and one for pop=61793. These children are PropertyNode [2]
>>>>> >>> objects, which have a key and (who would have thought) more
>>>>> >>> children.
>>>>> >>> Well, in this case, just one child, which is a TextNode. In a
>>>>> >>> nutshell: Find the "#switch" node, find children with keys "an" and
>>>>> >>> "pop", and generate triples for their values.
>>>>> >>>
>>>>> >>> >     case _ =>
>>>>> >>> > })
>>>>> >>> >
>>>>> >>> > And instead of to get : "Données/Antony/évolution population :
>>>>> >>> > pop :
>>>>> >>> > 61793"
>>>>> >>> > just once
>>>>> >>> >
>>>>> >>> > I have many : "Données/Antony/évolution population : pop : null"
>>>>> >>> > as
>>>>> >>> > much as
>>>>> >>> > there is line in the string
>>>>> >>> >
>>>>> >>> > An idea of what I do wrongly ?
>>>>> >>> >
>>>>> >>> > I'm totally beginner in Scala :-( sorry.
>>>>> >>>
>>>>> >>> Your code excerpt looks pretty good to me. :-)
>>>>> >>>
>>>>> >>> The AST is usually much safer and cleaner than regexes. Regexes are
>>>>> >>> more suitable for unstructured strings, but here you're dealing
>>>>> >>> with
>>>>> >>> pretty clean structures. So I would suggest you write some code
>>>>> >>> that
>>>>> >>> walks through the PageNode tree. If you have any questions, don't
>>>>> >>> hesitate to ask. We're looking forward to your contributions.
>>>>> >>> Thanks!
>>>>> >>>
>>>>> >>> Cheers,
>>>>> >>> JC
>>>>> >>>
>>>>> >>> [1]
>>>>> >>>
>>>>> >>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala
>>>>> >>> [2]
>>>>> >>>
>>>>> >>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala
>>>>> >>>
>>>>> >>> >
>>>>> >>> > Best.
>>>>> >>> >
>>>>> >>> > Julien.
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > 2013/4/22 Jona Christopher Sahnwaldt <[email protected]>
>>>>> >>> >>
>>>>> >>> >> The templates where data is stored are not used directly in the
>>>>> >>> >> main
>>>>> >>> >> pages. It's a complicated process: page Toulouse uses template
>>>>> >>> >> X, X
>>>>> >>> >> uses Y,
>>>>> >>> >> Y uses Z, and Z contains the data. Something like that, I'm 100%
>>>>> >>> >> sure,
>>>>> >>> >> but
>>>>> >>> >> the details don't matter. This means that wikiPageUsesTemplate
>>>>> >>> >> and
>>>>> >>> >> InfoboxExtractor won't help.
>>>>> >>> >>
>>>>> >>> >> Generating a separate file is probably the best idea. We could
>>>>> >>> >> also
>>>>> >>> >> send
>>>>> >>> >> these new triples to the main mapping based file, but that might
>>>>> >>> >> be
>>>>> >>> >> confusing: first, they're not mapping based; second, new triples
>>>>> >>> >> about
>>>>> >>> >> a
>>>>> >>> >> city would be added in a completely different place in the file.
>>>>> >>> >> (That's not
>>>>> >>> >> a big problem though.)
>>>>> >>> >>
>>>>> >>> >> Cheers,
>>>>> >>> >> JC
>>>>> >>> >
>>>>> >>> >
>>>>> >>
>>>>> >>
>>>>> >
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Kontokostas Dimitris
>>
>>
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Problem with extracted data

Reply via email to