Hi Julien, On 23 April 2013 23:16, Julien Plu <[email protected]> wrote: > Ok, I finished, now I made an extractor which works like we expected :-) I > don't think that what I did is well made but it works.
Cool! You can always improve it later. :-) > > Anyway only one problem stay, if I put my "PopulationExtractor.scala" file > from "mappings" folder into "fr" folder inside "mappings" folder the > extraction configuration file fail because he doesn't find the > "PopulationExtractor" class doesn't matter if I write > "fr.PopulationExtractor" or > "org.dbpedia.extraction.mappings.fr.PopulationExtractor". Any idea of what's > going on ? Does the package declaration in the class file include the ".fr"? Scala is less strict than Java here. If you send a pull request, we can have a look at your code and merge it into the main repository, so others can run this extraction as well. https://github.com/dbpedia/extraction-framework/wiki/Contributing > > Last thing I added a dataset inside the file "DBpediaDatasets.scala" like > that I have my own archive containing only the population informations. Right, that's one more thing you need to add. Thanks! JC > > Best. > > Julien. > > > 2013/4/23 Julien Plu <[email protected]> >> >> Yes I know IDE are really usefull but my working machine is on Windows and >> I'm really not familiar with. So I use a Linux distrib via a virtual machine >> but this virtual machine is too slow for coding with an IDE in graphics so I >> have to connect to this VM with a ssh connexion and use only the shell. >> >> I think that I will force me to use Windows that will be more easy than to >> continue to work like that :-D >> >> By the way I found my problem for the code. I was come from my regex, so >> instead to use """|pop=(\d+)""".r I use """pop=(\d+)""".r and now I have the >> good value that I want :-) >> >> Best. >> >> Julien. >> >> >> 2013/4/23 Dimitris Kontokostas <[email protected]> >>> >>> You should use an IDE for this,it will make you life a lot easier ;) >>> I use the intelliJ IDEA default debugger and works pretty good. I could >>> send you instructions to set it up >>> >>> Best, >>> Dimtiris >>> >>> >>> On Tue, Apr 23, 2013 at 3:59 PM, Julien Plu >>> <[email protected]> wrote: >>>> >>>> No I don't have a debugger because I'm coding on a remote machine via >>>> ssh. >>>> >>>> And even with this code : >>>> >>>> >>>> override def extract(page: PageNode, subjectUri: String, pageContext: >>>> PageContext): Seq[Quad] = { >>>> if (page.title.namespace != Namespace.Template || page.isRedirect >>>> || !page.title.decoded.contains("évolution population")) return Seq.empty >>>> >>>> for (property <- findPropertyNodes(page)) { >>>> println(property.toWikiText) >>>> } >>>> } >>>> private def findPropertyNodes(node : Node) : List[PropertyNode] = { >>>> >>>> node match { >>>> case propertyNode : PropertyNode => List(propertyNode) >>>> case _ = node.children.flatMap(findPropertyNodes) >>>> } >>>> >>>> Absolutely nothing is displayed, because the list returned by >>>> "findPropertyNodes" is empty and I don't understand why. I know she's empty >>>> because if I do that : >>>> >>>> if (findPropertyNodes(page).isEmpty) { >>>> println("empty") >>>> } >>>> else { >>>> println("no empty") >>>> } >>>> >>>> And "empty" is displayed whereas if I display "page.children" I have all >>>> the template code but the "findPropertyNodes" function doesn't find >>>> property >>>> inside this template code :-( >>>> >>>> Best. >>>> >>>> Julien. >>>> >>>> >>>> >>>> 2013/4/23 Jona Christopher Sahnwaldt <[email protected]> >>>>> >>>>> On 23 April 2013 12:01, Julien Plu >>>>> <[email protected]> wrote: >>>>> > Sorry but I really don't understand how AST works (and Scala too) I >>>>> > try to >>>>> > retrieve all the PropertyNode contained in a PageNode so I do : >>>>> > >>>>> > >>>>> > override def extract(page: PageNode, subjectUri: String, pageContext: >>>>> > PageContext): Seq[Quad] = { >>>>> > if (page.title.namespace != Namespace.Template || page.isRedirect >>>>> > || >>>>> > !page.title.decoded.contains("évolution population")) return >>>>> > Seq.empty >>>>> > >>>>> >>>>> I think it would be good if you could get a picture of the structure >>>>> of the tree. It's usually not complicated, but a bit hard to explain >>>>> in text. Can you use a debugger? If so, set a breakpoint at the >>>>> following line and let the debugger show the page variable. Then click >>>>> into it, look at its children, and so on. >>>>> >>>>> We should add a toString() method to Node.scala (and some sub-classes) >>>>> that shows the structure. >>>>> >>>>> > for (node <- page.children) { >>>>> > for (property <- allPropertiesNode(node)) { >>>>> > println(property.toWikiText) >>>>> > } >>>>> > } >>>>> > } >>>>> > >>>>> > private def allPropertiesNode(node : Node) : List[PropertyNode] = { >>>>> > node match { >>>>> > case propertyNode : PropertyNode => List(propertyNode) >>>>> > case _ = node.children >>>>> > } >>>>> >>>>> This is almost right. If I understand correctly, you want to walk >>>>> through the whole tree and collect all property nodes. Change this >>>>> line: >>>>> >>>>> case _ = node.children >>>>> >>>>> (does that even compile? I don't understand how... :-) ) to >>>>> >>>>> case _ => node.children.flatMap(allPropertiesNode) >>>>> >>>>> (I think that should work, I'm not 100% sure.) >>>>> >>>>> Oh by the way, the method name should be allPropertyNodes. :-) Or >>>>> maybe findPropertyNodes is even better. >>>>> >>>>> Once the method works, you can drop the main loop in extract(). Instead >>>>> of >>>>> >>>>> for (node <- page.children) { >>>>> for (property <- allPropertiesNode(node)) { >>>>> println(property.toWikiText) >>>>> } >>>>> } >>>>> >>>>> you can just write >>>>> >>>>> for (property <- findPropertyNodes(page)) { >>>>> println(property.toWikiText) >>>>> } >>>>> >>>>> But that's just cosmetic surgery, it has the same effect. >>>>> >>>>> Cheers, >>>>> JC >>>>> >>>>> > } >>>>> > >>>>> > >>>>> > And nothing is displayed on my screen :-( >>>>> > >>>>> > Any idea of what I do wrongly ? >>>>> > >>>>> > BesT. >>>>> > >>>>> > Julien. >>>>> > >>>>> > >>>>> > 2013/4/23 Julien Plu <[email protected]> >>>>> >> >>>>> >> Hi, >>>>> >> >>>>> >> param come from a bad copy paste, it's "pop" the good variable. >>>>> >> >>>>> >> By the way thank you for the hint about AST I will take a look at >>>>> >> these >>>>> >> class and see how I can use them. I won't hesitate to ask if I'm >>>>> >> blocked :-) >>>>> >> >>>>> >> Best. >>>>> >> >>>>> >> Julien. >>>>> >> >>>>> >> >>>>> >> 2013/4/22 Jona Christopher Sahnwaldt <[email protected]> >>>>> >>> >>>>> >>> Hi Julien, >>>>> >>> >>>>> >>> On 22 April 2013 21:43, Julien Plu >>>>> >>> <[email protected]> >>>>> >>> wrote: >>>>> >>> > I started the code for the extractor and I have a problem with >>>>> >>> > the >>>>> >>> > regex in >>>>> >>> > Scala. the string is : >>>>> >>> > >>>>> >>> > >>>>> >>> > http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit >>>>> >>> > >>>>> >>> > And my regex is : val populationRegex = """|pop=(\d+)""".r >>>>> >>> > >>>>> >>> > And I use this piece of code : >>>>> >>> > >>>>> >>> > populationRegex findAllIn page.children.toString foreach (_ >>>>> >>> > match { >>>>> >>> > case populationRegex (pop) => println(page.title.decoded + " >>>>> >>> > : pop >>>>> >>> > : " + >>>>> >>> > param) >>>>> >>> >>>>> >>> What is param? >>>>> >>> >>>>> >>> But more generally - did you try using the AST (abstract syntax >>>>> >>> tree) >>>>> >>> built by the parser, i.e. the tree whose root node is the PageNode? >>>>> >>> I'm not sure how good our parser is at dealing with stuff like >>>>> >>> "<includeonly>" and "{{#switch ...}}", but I think it works and >>>>> >>> page.children should contain a ParserFunctionNode [1] object for >>>>> >>> the >>>>> >>> #switch, which in turn has a child for each branch, e.g. one child >>>>> >>> for >>>>> >>> an=2010 and one for pop=61793. These children are PropertyNode [2] >>>>> >>> objects, which have a key and (who would have thought) more >>>>> >>> children. >>>>> >>> Well, in this case, just one child, which is a TextNode. In a >>>>> >>> nutshell: Find the "#switch" node, find children with keys "an" and >>>>> >>> "pop", and generate triples for their values. >>>>> >>> >>>>> >>> > case _ => >>>>> >>> > }) >>>>> >>> > >>>>> >>> > And instead of to get : "Données/Antony/évolution population : >>>>> >>> > pop : >>>>> >>> > 61793" >>>>> >>> > just once >>>>> >>> > >>>>> >>> > I have many : "Données/Antony/évolution population : pop : null" >>>>> >>> > as >>>>> >>> > much as >>>>> >>> > there is line in the string >>>>> >>> > >>>>> >>> > An idea of what I do wrongly ? >>>>> >>> > >>>>> >>> > I'm totally beginner in Scala :-( sorry. >>>>> >>> >>>>> >>> Your code excerpt looks pretty good to me. :-) >>>>> >>> >>>>> >>> The AST is usually much safer and cleaner than regexes. Regexes are >>>>> >>> more suitable for unstructured strings, but here you're dealing >>>>> >>> with >>>>> >>> pretty clean structures. So I would suggest you write some code >>>>> >>> that >>>>> >>> walks through the PageNode tree. If you have any questions, don't >>>>> >>> hesitate to ask. We're looking forward to your contributions. >>>>> >>> Thanks! >>>>> >>> >>>>> >>> Cheers, >>>>> >>> JC >>>>> >>> >>>>> >>> [1] >>>>> >>> >>>>> >>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala >>>>> >>> [2] >>>>> >>> >>>>> >>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala >>>>> >>> >>>>> >>> > >>>>> >>> > Best. >>>>> >>> > >>>>> >>> > Julien. >>>>> >>> > >>>>> >>> > >>>>> >>> > 2013/4/22 Jona Christopher Sahnwaldt <[email protected]> >>>>> >>> >> >>>>> >>> >> The templates where data is stored are not used directly in the >>>>> >>> >> main >>>>> >>> >> pages. It's a complicated process: page Toulouse uses template >>>>> >>> >> X, X >>>>> >>> >> uses Y, >>>>> >>> >> Y uses Z, and Z contains the data. Something like that, I'm 100% >>>>> >>> >> sure, >>>>> >>> >> but >>>>> >>> >> the details don't matter. This means that wikiPageUsesTemplate >>>>> >>> >> and >>>>> >>> >> InfoboxExtractor won't help. >>>>> >>> >> >>>>> >>> >> Generating a separate file is probably the best idea. We could >>>>> >>> >> also >>>>> >>> >> send >>>>> >>> >> these new triples to the main mapping based file, but that might >>>>> >>> >> be >>>>> >>> >> confusing: first, they're not mapping based; second, new triples >>>>> >>> >> about >>>>> >>> >> a >>>>> >>> >> city would be added in a completely different place in the file. >>>>> >>> >> (That's not >>>>> >>> >> a big problem though.) >>>>> >>> >> >>>>> >>> >> Cheers, >>>>> >>> >> JC >>>>> >>> > >>>>> >>> > >>>>> >> >>>>> >> >>>>> > >>>> >>>> >>> >>> >>> >>> -- >>> Kontokostas Dimitris >> >> > ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
