Re: [Dbpedia-discussion] Problem with extracted data

Julien Plu Tue, 23 Apr 2013 14:18:06 -0700

Ok, I finished, now I made an extractor which works like we expected :-) I
don't think that what I did is well made but it works.


Anyway only one problem stay, if I put my "PopulationExtractor.scala" file
from "mappings" folder into "fr" folder inside "mappings" folder the
extraction configuration file fail because he doesn't find the
"PopulationExtractor" class doesn't matter if I write
"fr.PopulationExtractor" or
"org.dbpedia.extraction.mappings.fr.PopulationExtractor". Any idea of
what's going on ?

Last thing I added a dataset inside the file "DBpediaDatasets.scala" like
that I have my own archive containing only the population informations.

Best.

Julien.


2013/4/23 Julien Plu <[email protected]>

> Yes I know IDE are really usefull but my working machine is on Windows and
> I'm really not familiar with. So I use a Linux distrib via a virtual
> machine but this virtual machine is too slow for coding with an IDE in
> graphics so I have to connect to this VM with a ssh connexion and use only
> the shell.
>
> I think that I will force me to use Windows that will be more easy than to
> continue to work like that :-D
>
> By the way I found my problem for the code. I was come from my regex, so
> instead to use """|pop=(\d+)""".r I use """pop=(\d+)""".r and now I have
> the good value that I want :-)
>
> Best.
>
> Julien.
>
>
> 2013/4/23 Dimitris Kontokostas <[email protected]>
>
>> You should use an IDE for this,it will make you life a lot easier ;)
>> I use the intelliJ IDEA default debugger and works pretty good. I could
>> send you instructions to set it up
>>
>> Best,
>> Dimtiris
>>
>>
>> On Tue, Apr 23, 2013 at 3:59 PM, Julien Plu <
>> [email protected]> wrote:
>>
>>> No I don't have a debugger because I'm coding on a remote machine via
>>> ssh.
>>>
>>> And even with this code :
>>>
>>>
>>> override def extract(page: PageNode, subjectUri: String, pageContext:
>>> PageContext): Seq[Quad] = {
>>>      if (page.title.namespace != Namespace.Template || page.isRedirect
>>> || !page.title.decoded.contains("évolution population")) return Seq.empty
>>>
>>>     for (property <- findPropertyNodes(page)) {
>>>         println(property.toWikiText)
>>>     }
>>> }
>>> private def findPropertyNodes(node : Node) : List[PropertyNode] = {
>>>
>>>     node match {
>>>         case propertyNode : PropertyNode => List(propertyNode)
>>>         case _ = node.children.flatMap(findPropertyNodes)
>>> }
>>>
>>> Absolutely nothing is displayed, because the list returned by
>>> "findPropertyNodes" is empty and I don't understand why. I know she's empty
>>> because if I do that :
>>>
>>> if (findPropertyNodes(page).isEmpty) {
>>>     println("empty")
>>> }
>>> else {
>>>     println("no empty")
>>> }
>>>
>>> And "empty" is displayed whereas if I display "page.children" I have all
>>> the template code but the "findPropertyNodes" function doesn't find
>>> property inside this template code :-(
>>>
>>> Best.
>>>
>>> Julien.
>>>
>>>
>>>
>>> 2013/4/23 Jona Christopher Sahnwaldt <[email protected]>
>>>
>>>> On 23 April 2013 12:01, Julien Plu <[email protected]>
>>>> wrote:
>>>> > Sorry but I really don't understand how AST works (and Scala too) I
>>>> try to
>>>> > retrieve all the PropertyNode contained in a PageNode so I do :
>>>> >
>>>> >
>>>> > override def extract(page: PageNode, subjectUri: String, pageContext:
>>>> > PageContext): Seq[Quad] = {
>>>> >     if (page.title.namespace != Namespace.Template || page.isRedirect
>>>> ||
>>>> > !page.title.decoded.contains("évolution population")) return Seq.empty
>>>> >
>>>>
>>>> I think it would be good if you could get a picture of the structure
>>>> of the tree. It's usually not complicated, but a bit hard to explain
>>>> in text. Can you use a debugger? If so, set a breakpoint at the
>>>> following line and let the debugger show the page variable. Then click
>>>> into it, look at its children, and so on.
>>>>
>>>> We should add a toString() method to Node.scala (and some sub-classes)
>>>> that shows the structure.
>>>>
>>>> >     for (node <- page.children) {
>>>> >         for (property <- allPropertiesNode(node)) {
>>>> >             println(property.toWikiText)
>>>> >         }
>>>> >     }
>>>> > }
>>>> >
>>>> > private def allPropertiesNode(node : Node) : List[PropertyNode] = {
>>>> >     node match {
>>>> >         case propertyNode : PropertyNode => List(propertyNode)
>>>> >         case _ = node.children
>>>> >    }
>>>>
>>>> This is almost right. If I understand correctly, you want to walk
>>>> through the whole tree and collect all property nodes. Change this
>>>> line:
>>>>
>>>>     case _ = node.children
>>>>
>>>> (does that even compile? I don't understand how... :-) ) to
>>>>
>>>>     case _ => node.children.flatMap(allPropertiesNode)
>>>>
>>>> (I think that should work, I'm not 100% sure.)
>>>>
>>>> Oh by the way, the method name should be allPropertyNodes. :-) Or
>>>> maybe findPropertyNodes is even better.
>>>>
>>>> Once the method works, you can drop the main loop in extract(). Instead
>>>> of
>>>>
>>>> for (node <- page.children) {
>>>>     for (property <- allPropertiesNode(node)) {
>>>>         println(property.toWikiText)
>>>>     }
>>>> }
>>>>
>>>> you can just write
>>>>
>>>> for (property <- findPropertyNodes(page)) {
>>>>     println(property.toWikiText)
>>>> }
>>>>
>>>> But that's just cosmetic surgery, it has the same effect.
>>>>
>>>> Cheers,
>>>> JC
>>>>
>>>> > }
>>>> >
>>>> >
>>>> > And nothing is displayed on my screen :-(
>>>> >
>>>> > Any idea of what I do wrongly ?
>>>> >
>>>> > BesT.
>>>> >
>>>> > Julien.
>>>> >
>>>> >
>>>> > 2013/4/23 Julien Plu <[email protected]>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> param come from a bad copy paste, it's "pop" the good variable.
>>>> >>
>>>> >> By the way thank you for the hint about AST I will take a look at
>>>> these
>>>> >> class and see how I can use them. I won't hesitate to ask if I'm
>>>> blocked :-)
>>>> >>
>>>> >> Best.
>>>> >>
>>>> >> Julien.
>>>> >>
>>>> >>
>>>> >> 2013/4/22 Jona Christopher Sahnwaldt <[email protected]>
>>>> >>>
>>>> >>> Hi Julien,
>>>> >>>
>>>> >>> On 22 April 2013 21:43, Julien Plu <
>>>> [email protected]>
>>>> >>> wrote:
>>>> >>> > I started the code for the extractor and I have a problem with the
>>>> >>> > regex in
>>>> >>> > Scala. the string is :
>>>> >>> >
>>>> >>> >
>>>> http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit
>>>> >>> >
>>>> >>> > And my regex is : val populationRegex = """|pop=(\d+)""".r
>>>> >>> >
>>>> >>> > And I use this piece of code :
>>>> >>> >
>>>> >>> > populationRegex findAllIn  page.children.toString foreach (_
>>>> match {
>>>> >>> >     case populationRegex (pop) => println(page.title.decoded + "
>>>> : pop
>>>> >>> > : " +
>>>> >>> > param)
>>>> >>>
>>>> >>> What is param?
>>>> >>>
>>>> >>> But more generally - did you try using the AST (abstract syntax
>>>> tree)
>>>> >>> built by the parser, i.e. the tree whose root node is the PageNode?
>>>> >>> I'm not sure how good our parser is at dealing with stuff like
>>>> >>> "<includeonly>" and "{{#switch ...}}", but I think it works and
>>>> >>> page.children should contain a ParserFunctionNode [1] object for the
>>>> >>> #switch, which in turn has a child for each branch, e.g. one child
>>>> for
>>>> >>> an=2010 and one for pop=61793. These children are PropertyNode [2]
>>>> >>> objects, which have a key and (who would have thought) more
>>>> children.
>>>> >>> Well, in this case, just one child, which is a TextNode. In a
>>>> >>> nutshell: Find the "#switch" node, find children with keys "an" and
>>>> >>> "pop", and generate triples for their values.
>>>> >>>
>>>> >>> >     case _ =>
>>>> >>> > })
>>>> >>> >
>>>> >>> > And instead of to get : "Données/Antony/évolution population :
>>>> pop :
>>>> >>> > 61793"
>>>> >>> > just once
>>>> >>> >
>>>> >>> > I have many : "Données/Antony/évolution population : pop : null"
>>>> as
>>>> >>> > much as
>>>> >>> > there is line in the string
>>>> >>> >
>>>> >>> > An idea of what I do wrongly ?
>>>> >>> >
>>>> >>> > I'm totally beginner in Scala :-( sorry.
>>>> >>>
>>>> >>> Your code excerpt looks pretty good to me. :-)
>>>> >>>
>>>> >>> The AST is usually much safer and cleaner than regexes. Regexes are
>>>> >>> more suitable for unstructured strings, but here you're dealing with
>>>> >>> pretty clean structures. So I would suggest you write some code that
>>>> >>> walks through the PageNode tree. If you have any questions, don't
>>>> >>> hesitate to ask. We're looking forward to your contributions.
>>>> Thanks!
>>>> >>>
>>>> >>> Cheers,
>>>> >>> JC
>>>> >>>
>>>> >>> [1]
>>>> >>>
>>>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala
>>>> >>> [2]
>>>> >>>
>>>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala
>>>> >>>
>>>> >>> >
>>>> >>> > Best.
>>>> >>> >
>>>> >>> > Julien.
>>>> >>> >
>>>> >>> >
>>>> >>> > 2013/4/22 Jona Christopher Sahnwaldt <[email protected]>
>>>> >>> >>
>>>> >>> >> The templates where data is stored are not used directly in the
>>>> main
>>>> >>> >> pages. It's a complicated process: page Toulouse uses template
>>>> X, X
>>>> >>> >> uses Y,
>>>> >>> >> Y uses Z, and Z contains the data. Something like that, I'm 100%
>>>> sure,
>>>> >>> >> but
>>>> >>> >> the details don't matter. This means that wikiPageUsesTemplate
>>>> and
>>>> >>> >> InfoboxExtractor won't help.
>>>> >>> >>
>>>> >>> >> Generating a separate file is probably the best idea. We could
>>>> also
>>>> >>> >> send
>>>> >>> >> these new triples to the main mapping based file, but that might
>>>> be
>>>> >>> >> confusing: first, they're not mapping based; second, new triples
>>>> about
>>>> >>> >> a
>>>> >>> >> city would be added in a completely different place in the file.
>>>> >>> >> (That's not
>>>> >>> >> a big problem though.)
>>>> >>> >>
>>>> >>> >> Cheers,
>>>> >>> >> JC
>>>> >>> >
>>>> >>> >
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Problem with extracted data

Reply via email to