Re: [Dbpedia-discussion] Problem with extracted data

Julien Plu Tue, 23 Apr 2013 07:59:27 -0700

Yes I know IDE are really usefull but my working machine is on Windows and
I'm really not familiar with. So I use a Linux distrib via a virtual
machine but this virtual machine is too slow for coding with an IDE in
graphics so I have to connect to this VM with a ssh connexion and use only
the shell.


I think that I will force me to use Windows that will be more easy than to
continue to work like that :-D

By the way I found my problem for the code. I was come from my regex, so
instead to use """|pop=(\d+)""".r I use """pop=(\d+)""".r and now I have
the good value that I want :-)

Best.

Julien.


2013/4/23 Dimitris Kontokostas <[email protected]>

> You should use an IDE for this,it will make you life a lot easier ;)
> I use the intelliJ IDEA default debugger and works pretty good. I could
> send you instructions to set it up
>
> Best,
> Dimtiris
>
>
> On Tue, Apr 23, 2013 at 3:59 PM, Julien Plu <
> [email protected]> wrote:
>
>> No I don't have a debugger because I'm coding on a remote machine via ssh.
>>
>> And even with this code :
>>
>>
>> override def extract(page: PageNode, subjectUri: String, pageContext:
>> PageContext): Seq[Quad] = {
>>      if (page.title.namespace != Namespace.Template || page.isRedirect ||
>> !page.title.decoded.contains("évolution population")) return Seq.empty
>>
>>     for (property <- findPropertyNodes(page)) {
>>         println(property.toWikiText)
>>     }
>> }
>> private def findPropertyNodes(node : Node) : List[PropertyNode] = {
>>
>>     node match {
>>         case propertyNode : PropertyNode => List(propertyNode)
>>         case _ = node.children.flatMap(findPropertyNodes)
>> }
>>
>> Absolutely nothing is displayed, because the list returned by
>> "findPropertyNodes" is empty and I don't understand why. I know she's empty
>> because if I do that :
>>
>> if (findPropertyNodes(page).isEmpty) {
>>     println("empty")
>> }
>> else {
>>     println("no empty")
>> }
>>
>> And "empty" is displayed whereas if I display "page.children" I have all
>> the template code but the "findPropertyNodes" function doesn't find
>> property inside this template code :-(
>>
>> Best.
>>
>> Julien.
>>
>>
>>
>> 2013/4/23 Jona Christopher Sahnwaldt <[email protected]>
>>
>>> On 23 April 2013 12:01, Julien Plu <[email protected]>
>>> wrote:
>>> > Sorry but I really don't understand how AST works (and Scala too) I
>>> try to
>>> > retrieve all the PropertyNode contained in a PageNode so I do :
>>> >
>>> >
>>> > override def extract(page: PageNode, subjectUri: String, pageContext:
>>> > PageContext): Seq[Quad] = {
>>> >     if (page.title.namespace != Namespace.Template || page.isRedirect
>>> ||
>>> > !page.title.decoded.contains("évolution population")) return Seq.empty
>>> >
>>>
>>> I think it would be good if you could get a picture of the structure
>>> of the tree. It's usually not complicated, but a bit hard to explain
>>> in text. Can you use a debugger? If so, set a breakpoint at the
>>> following line and let the debugger show the page variable. Then click
>>> into it, look at its children, and so on.
>>>
>>> We should add a toString() method to Node.scala (and some sub-classes)
>>> that shows the structure.
>>>
>>> >     for (node <- page.children) {
>>> >         for (property <- allPropertiesNode(node)) {
>>> >             println(property.toWikiText)
>>> >         }
>>> >     }
>>> > }
>>> >
>>> > private def allPropertiesNode(node : Node) : List[PropertyNode] = {
>>> >     node match {
>>> >         case propertyNode : PropertyNode => List(propertyNode)
>>> >         case _ = node.children
>>> >    }
>>>
>>> This is almost right. If I understand correctly, you want to walk
>>> through the whole tree and collect all property nodes. Change this
>>> line:
>>>
>>>     case _ = node.children
>>>
>>> (does that even compile? I don't understand how... :-) ) to
>>>
>>>     case _ => node.children.flatMap(allPropertiesNode)
>>>
>>> (I think that should work, I'm not 100% sure.)
>>>
>>> Oh by the way, the method name should be allPropertyNodes. :-) Or
>>> maybe findPropertyNodes is even better.
>>>
>>> Once the method works, you can drop the main loop in extract(). Instead
>>> of
>>>
>>> for (node <- page.children) {
>>>     for (property <- allPropertiesNode(node)) {
>>>         println(property.toWikiText)
>>>     }
>>> }
>>>
>>> you can just write
>>>
>>> for (property <- findPropertyNodes(page)) {
>>>     println(property.toWikiText)
>>> }
>>>
>>> But that's just cosmetic surgery, it has the same effect.
>>>
>>> Cheers,
>>> JC
>>>
>>> > }
>>> >
>>> >
>>> > And nothing is displayed on my screen :-(
>>> >
>>> > Any idea of what I do wrongly ?
>>> >
>>> > BesT.
>>> >
>>> > Julien.
>>> >
>>> >
>>> > 2013/4/23 Julien Plu <[email protected]>
>>> >>
>>> >> Hi,
>>> >>
>>> >> param come from a bad copy paste, it's "pop" the good variable.
>>> >>
>>> >> By the way thank you for the hint about AST I will take a look at
>>> these
>>> >> class and see how I can use them. I won't hesitate to ask if I'm
>>> blocked :-)
>>> >>
>>> >> Best.
>>> >>
>>> >> Julien.
>>> >>
>>> >>
>>> >> 2013/4/22 Jona Christopher Sahnwaldt <[email protected]>
>>> >>>
>>> >>> Hi Julien,
>>> >>>
>>> >>> On 22 April 2013 21:43, Julien Plu <
>>> [email protected]>
>>> >>> wrote:
>>> >>> > I started the code for the extractor and I have a problem with the
>>> >>> > regex in
>>> >>> > Scala. the string is :
>>> >>> >
>>> >>> >
>>> http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit
>>> >>> >
>>> >>> > And my regex is : val populationRegex = """|pop=(\d+)""".r
>>> >>> >
>>> >>> > And I use this piece of code :
>>> >>> >
>>> >>> > populationRegex findAllIn  page.children.toString foreach (_ match
>>> {
>>> >>> >     case populationRegex (pop) => println(page.title.decoded + " :
>>> pop
>>> >>> > : " +
>>> >>> > param)
>>> >>>
>>> >>> What is param?
>>> >>>
>>> >>> But more generally - did you try using the AST (abstract syntax tree)
>>> >>> built by the parser, i.e. the tree whose root node is the PageNode?
>>> >>> I'm not sure how good our parser is at dealing with stuff like
>>> >>> "<includeonly>" and "{{#switch ...}}", but I think it works and
>>> >>> page.children should contain a ParserFunctionNode [1] object for the
>>> >>> #switch, which in turn has a child for each branch, e.g. one child
>>> for
>>> >>> an=2010 and one for pop=61793. These children are PropertyNode [2]
>>> >>> objects, which have a key and (who would have thought) more children.
>>> >>> Well, in this case, just one child, which is a TextNode. In a
>>> >>> nutshell: Find the "#switch" node, find children with keys "an" and
>>> >>> "pop", and generate triples for their values.
>>> >>>
>>> >>> >     case _ =>
>>> >>> > })
>>> >>> >
>>> >>> > And instead of to get : "Données/Antony/évolution population : pop
>>> :
>>> >>> > 61793"
>>> >>> > just once
>>> >>> >
>>> >>> > I have many : "Données/Antony/évolution population : pop : null" as
>>> >>> > much as
>>> >>> > there is line in the string
>>> >>> >
>>> >>> > An idea of what I do wrongly ?
>>> >>> >
>>> >>> > I'm totally beginner in Scala :-( sorry.
>>> >>>
>>> >>> Your code excerpt looks pretty good to me. :-)
>>> >>>
>>> >>> The AST is usually much safer and cleaner than regexes. Regexes are
>>> >>> more suitable for unstructured strings, but here you're dealing with
>>> >>> pretty clean structures. So I would suggest you write some code that
>>> >>> walks through the PageNode tree. If you have any questions, don't
>>> >>> hesitate to ask. We're looking forward to your contributions. Thanks!
>>> >>>
>>> >>> Cheers,
>>> >>> JC
>>> >>>
>>> >>> [1]
>>> >>>
>>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala
>>> >>> [2]
>>> >>>
>>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala
>>> >>>
>>> >>> >
>>> >>> > Best.
>>> >>> >
>>> >>> > Julien.
>>> >>> >
>>> >>> >
>>> >>> > 2013/4/22 Jona Christopher Sahnwaldt <[email protected]>
>>> >>> >>
>>> >>> >> The templates where data is stored are not used directly in the
>>> main
>>> >>> >> pages. It's a complicated process: page Toulouse uses template X,
>>> X
>>> >>> >> uses Y,
>>> >>> >> Y uses Z, and Z contains the data. Something like that, I'm 100%
>>> sure,
>>> >>> >> but
>>> >>> >> the details don't matter. This means that wikiPageUsesTemplate and
>>> >>> >> InfoboxExtractor won't help.
>>> >>> >>
>>> >>> >> Generating a separate file is probably the best idea. We could
>>> also
>>> >>> >> send
>>> >>> >> these new triples to the main mapping based file, but that might
>>> be
>>> >>> >> confusing: first, they're not mapping based; second, new triples
>>> about
>>> >>> >> a
>>> >>> >> city would be added in a completely different place in the file.
>>> >>> >> (That's not
>>> >>> >> a big problem though.)
>>> >>> >>
>>> >>> >> Cheers,
>>> >>> >> JC
>>> >>> >
>>> >>> >
>>> >>
>>> >>
>>> >
>>>
>>
>>
>
>
> --
> Kontokostas Dimitris
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Problem with extracted data

Reply via email to