Cool, thanks! Your code looks good. I peppered your pull request with a few comments. None of them are major problems, but if you have time, please have a look at them. If you don't have time, please copy the comments into TODO comments in the code and we can fix them later.
On 24 April 2013 16:34, Julien Plu <[email protected]> wrote: > You had right Jona the problem came from a bad path in the package. > > I just sent a new pull request with my extractor :-) > > Best. > > Julien. > > > 2013/4/23 Jona Christopher Sahnwaldt <[email protected]> >> >> Hi Julien, >> >> On 23 April 2013 23:16, Julien Plu <[email protected]> >> wrote: >> > Ok, I finished, now I made an extractor which works like we expected :-) >> > I >> > don't think that what I did is well made but it works. >> >> Cool! You can always improve it later. :-) >> >> > >> > Anyway only one problem stay, if I put my "PopulationExtractor.scala" >> > file >> > from "mappings" folder into "fr" folder inside "mappings" folder the >> > extraction configuration file fail because he doesn't find the >> > "PopulationExtractor" class doesn't matter if I write >> > "fr.PopulationExtractor" or >> > "org.dbpedia.extraction.mappings.fr.PopulationExtractor". Any idea of >> > what's >> > going on ? >> >> Does the package declaration in the class file include the ".fr"? >> Scala is less strict than Java here. >> >> If you send a pull request, we can have a look at your code and merge >> it into the main repository, so others can run this extraction as >> well. >> >> https://github.com/dbpedia/extraction-framework/wiki/Contributing >> >> > >> > Last thing I added a dataset inside the file "DBpediaDatasets.scala" >> > like >> > that I have my own archive containing only the population informations. >> >> Right, that's one more thing you need to add. >> >> Thanks! >> >> JC >> >> >> > >> > Best. >> > >> > Julien. >> > >> > >> > 2013/4/23 Julien Plu <[email protected]> >> >> >> >> Yes I know IDE are really usefull but my working machine is on Windows >> >> and >> >> I'm really not familiar with. So I use a Linux distrib via a virtual >> >> machine >> >> but this virtual machine is too slow for coding with an IDE in graphics >> >> so I >> >> have to connect to this VM with a ssh connexion and use only the shell. >> >> >> >> I think that I will force me to use Windows that will be more easy than >> >> to >> >> continue to work like that :-D >> >> >> >> By the way I found my problem for the code. I was come from my regex, >> >> so >> >> instead to use """|pop=(\d+)""".r I use """pop=(\d+)""".r and now I >> >> have the >> >> good value that I want :-) >> >> >> >> Best. >> >> >> >> Julien. >> >> >> >> >> >> 2013/4/23 Dimitris Kontokostas <[email protected]> >> >>> >> >>> You should use an IDE for this,it will make you life a lot easier ;) >> >>> I use the intelliJ IDEA default debugger and works pretty good. I >> >>> could >> >>> send you instructions to set it up >> >>> >> >>> Best, >> >>> Dimtiris >> >>> >> >>> >> >>> On Tue, Apr 23, 2013 at 3:59 PM, Julien Plu >> >>> <[email protected]> wrote: >> >>>> >> >>>> No I don't have a debugger because I'm coding on a remote machine via >> >>>> ssh. >> >>>> >> >>>> And even with this code : >> >>>> >> >>>> >> >>>> override def extract(page: PageNode, subjectUri: String, pageContext: >> >>>> PageContext): Seq[Quad] = { >> >>>> if (page.title.namespace != Namespace.Template || >> >>>> page.isRedirect >> >>>> || !page.title.decoded.contains("évolution population")) return >> >>>> Seq.empty >> >>>> >> >>>> for (property <- findPropertyNodes(page)) { >> >>>> println(property.toWikiText) >> >>>> } >> >>>> } >> >>>> private def findPropertyNodes(node : Node) : List[PropertyNode] = { >> >>>> >> >>>> node match { >> >>>> case propertyNode : PropertyNode => List(propertyNode) >> >>>> case _ = node.children.flatMap(findPropertyNodes) >> >>>> } >> >>>> >> >>>> Absolutely nothing is displayed, because the list returned by >> >>>> "findPropertyNodes" is empty and I don't understand why. I know she's >> >>>> empty >> >>>> because if I do that : >> >>>> >> >>>> if (findPropertyNodes(page).isEmpty) { >> >>>> println("empty") >> >>>> } >> >>>> else { >> >>>> println("no empty") >> >>>> } >> >>>> >> >>>> And "empty" is displayed whereas if I display "page.children" I have >> >>>> all >> >>>> the template code but the "findPropertyNodes" function doesn't find >> >>>> property >> >>>> inside this template code :-( >> >>>> >> >>>> Best. >> >>>> >> >>>> Julien. >> >>>> >> >>>> >> >>>> >> >>>> 2013/4/23 Jona Christopher Sahnwaldt <[email protected]> >> >>>>> >> >>>>> On 23 April 2013 12:01, Julien Plu >> >>>>> <[email protected]> wrote: >> >>>>> > Sorry but I really don't understand how AST works (and Scala too) >> >>>>> > I >> >>>>> > try to >> >>>>> > retrieve all the PropertyNode contained in a PageNode so I do : >> >>>>> > >> >>>>> > >> >>>>> > override def extract(page: PageNode, subjectUri: String, >> >>>>> > pageContext: >> >>>>> > PageContext): Seq[Quad] = { >> >>>>> > if (page.title.namespace != Namespace.Template || >> >>>>> > page.isRedirect >> >>>>> > || >> >>>>> > !page.title.decoded.contains("évolution population")) return >> >>>>> > Seq.empty >> >>>>> > >> >>>>> >> >>>>> I think it would be good if you could get a picture of the structure >> >>>>> of the tree. It's usually not complicated, but a bit hard to explain >> >>>>> in text. Can you use a debugger? If so, set a breakpoint at the >> >>>>> following line and let the debugger show the page variable. Then >> >>>>> click >> >>>>> into it, look at its children, and so on. >> >>>>> >> >>>>> We should add a toString() method to Node.scala (and some >> >>>>> sub-classes) >> >>>>> that shows the structure. >> >>>>> >> >>>>> > for (node <- page.children) { >> >>>>> > for (property <- allPropertiesNode(node)) { >> >>>>> > println(property.toWikiText) >> >>>>> > } >> >>>>> > } >> >>>>> > } >> >>>>> > >> >>>>> > private def allPropertiesNode(node : Node) : List[PropertyNode] = >> >>>>> > { >> >>>>> > node match { >> >>>>> > case propertyNode : PropertyNode => List(propertyNode) >> >>>>> > case _ = node.children >> >>>>> > } >> >>>>> >> >>>>> This is almost right. If I understand correctly, you want to walk >> >>>>> through the whole tree and collect all property nodes. Change this >> >>>>> line: >> >>>>> >> >>>>> case _ = node.children >> >>>>> >> >>>>> (does that even compile? I don't understand how... :-) ) to >> >>>>> >> >>>>> case _ => node.children.flatMap(allPropertiesNode) >> >>>>> >> >>>>> (I think that should work, I'm not 100% sure.) >> >>>>> >> >>>>> Oh by the way, the method name should be allPropertyNodes. :-) Or >> >>>>> maybe findPropertyNodes is even better. >> >>>>> >> >>>>> Once the method works, you can drop the main loop in extract(). >> >>>>> Instead >> >>>>> of >> >>>>> >> >>>>> for (node <- page.children) { >> >>>>> for (property <- allPropertiesNode(node)) { >> >>>>> println(property.toWikiText) >> >>>>> } >> >>>>> } >> >>>>> >> >>>>> you can just write >> >>>>> >> >>>>> for (property <- findPropertyNodes(page)) { >> >>>>> println(property.toWikiText) >> >>>>> } >> >>>>> >> >>>>> But that's just cosmetic surgery, it has the same effect. >> >>>>> >> >>>>> Cheers, >> >>>>> JC >> >>>>> >> >>>>> > } >> >>>>> > >> >>>>> > >> >>>>> > And nothing is displayed on my screen :-( >> >>>>> > >> >>>>> > Any idea of what I do wrongly ? >> >>>>> > >> >>>>> > BesT. >> >>>>> > >> >>>>> > Julien. >> >>>>> > >> >>>>> > >> >>>>> > 2013/4/23 Julien Plu <[email protected]> >> >>>>> >> >> >>>>> >> Hi, >> >>>>> >> >> >>>>> >> param come from a bad copy paste, it's "pop" the good variable. >> >>>>> >> >> >>>>> >> By the way thank you for the hint about AST I will take a look at >> >>>>> >> these >> >>>>> >> class and see how I can use them. I won't hesitate to ask if I'm >> >>>>> >> blocked :-) >> >>>>> >> >> >>>>> >> Best. >> >>>>> >> >> >>>>> >> Julien. >> >>>>> >> >> >>>>> >> >> >>>>> >> 2013/4/22 Jona Christopher Sahnwaldt <[email protected]> >> >>>>> >>> >> >>>>> >>> Hi Julien, >> >>>>> >>> >> >>>>> >>> On 22 April 2013 21:43, Julien Plu >> >>>>> >>> <[email protected]> >> >>>>> >>> wrote: >> >>>>> >>> > I started the code for the extractor and I have a problem with >> >>>>> >>> > the >> >>>>> >>> > regex in >> >>>>> >>> > Scala. the string is : >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit >> >>>>> >>> > >> >>>>> >>> > And my regex is : val populationRegex = """|pop=(\d+)""".r >> >>>>> >>> > >> >>>>> >>> > And I use this piece of code : >> >>>>> >>> > >> >>>>> >>> > populationRegex findAllIn page.children.toString foreach (_ >> >>>>> >>> > match { >> >>>>> >>> > case populationRegex (pop) => println(page.title.decoded + >> >>>>> >>> > " >> >>>>> >>> > : pop >> >>>>> >>> > : " + >> >>>>> >>> > param) >> >>>>> >>> >> >>>>> >>> What is param? >> >>>>> >>> >> >>>>> >>> But more generally - did you try using the AST (abstract syntax >> >>>>> >>> tree) >> >>>>> >>> built by the parser, i.e. the tree whose root node is the >> >>>>> >>> PageNode? >> >>>>> >>> I'm not sure how good our parser is at dealing with stuff like >> >>>>> >>> "<includeonly>" and "{{#switch ...}}", but I think it works and >> >>>>> >>> page.children should contain a ParserFunctionNode [1] object for >> >>>>> >>> the >> >>>>> >>> #switch, which in turn has a child for each branch, e.g. one >> >>>>> >>> child >> >>>>> >>> for >> >>>>> >>> an=2010 and one for pop=61793. These children are PropertyNode >> >>>>> >>> [2] >> >>>>> >>> objects, which have a key and (who would have thought) more >> >>>>> >>> children. >> >>>>> >>> Well, in this case, just one child, which is a TextNode. In a >> >>>>> >>> nutshell: Find the "#switch" node, find children with keys "an" >> >>>>> >>> and >> >>>>> >>> "pop", and generate triples for their values. >> >>>>> >>> >> >>>>> >>> > case _ => >> >>>>> >>> > }) >> >>>>> >>> > >> >>>>> >>> > And instead of to get : "Données/Antony/évolution population : >> >>>>> >>> > pop : >> >>>>> >>> > 61793" >> >>>>> >>> > just once >> >>>>> >>> > >> >>>>> >>> > I have many : "Données/Antony/évolution population : pop : >> >>>>> >>> > null" >> >>>>> >>> > as >> >>>>> >>> > much as >> >>>>> >>> > there is line in the string >> >>>>> >>> > >> >>>>> >>> > An idea of what I do wrongly ? >> >>>>> >>> > >> >>>>> >>> > I'm totally beginner in Scala :-( sorry. >> >>>>> >>> >> >>>>> >>> Your code excerpt looks pretty good to me. :-) >> >>>>> >>> >> >>>>> >>> The AST is usually much safer and cleaner than regexes. Regexes >> >>>>> >>> are >> >>>>> >>> more suitable for unstructured strings, but here you're dealing >> >>>>> >>> with >> >>>>> >>> pretty clean structures. So I would suggest you write some code >> >>>>> >>> that >> >>>>> >>> walks through the PageNode tree. If you have any questions, >> >>>>> >>> don't >> >>>>> >>> hesitate to ask. We're looking forward to your contributions. >> >>>>> >>> Thanks! >> >>>>> >>> >> >>>>> >>> Cheers, >> >>>>> >>> JC >> >>>>> >>> >> >>>>> >>> [1] >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala >> >>>>> >>> [2] >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala >> >>>>> >>> >> >>>>> >>> > >> >>>>> >>> > Best. >> >>>>> >>> > >> >>>>> >>> > Julien. >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > 2013/4/22 Jona Christopher Sahnwaldt <[email protected]> >> >>>>> >>> >> >> >>>>> >>> >> The templates where data is stored are not used directly in >> >>>>> >>> >> the >> >>>>> >>> >> main >> >>>>> >>> >> pages. It's a complicated process: page Toulouse uses >> >>>>> >>> >> template >> >>>>> >>> >> X, X >> >>>>> >>> >> uses Y, >> >>>>> >>> >> Y uses Z, and Z contains the data. Something like that, I'm >> >>>>> >>> >> 100% >> >>>>> >>> >> sure, >> >>>>> >>> >> but >> >>>>> >>> >> the details don't matter. This means that >> >>>>> >>> >> wikiPageUsesTemplate >> >>>>> >>> >> and >> >>>>> >>> >> InfoboxExtractor won't help. >> >>>>> >>> >> >> >>>>> >>> >> Generating a separate file is probably the best idea. We >> >>>>> >>> >> could >> >>>>> >>> >> also >> >>>>> >>> >> send >> >>>>> >>> >> these new triples to the main mapping based file, but that >> >>>>> >>> >> might >> >>>>> >>> >> be >> >>>>> >>> >> confusing: first, they're not mapping based; second, new >> >>>>> >>> >> triples >> >>>>> >>> >> about >> >>>>> >>> >> a >> >>>>> >>> >> city would be added in a completely different place in the >> >>>>> >>> >> file. >> >>>>> >>> >> (That's not >> >>>>> >>> >> a big problem though.) >> >>>>> >>> >> >> >>>>> >>> >> Cheers, >> >>>>> >>> >> JC >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >> >> >>>>> >> >> >>>>> > >> >>>> >> >>>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Kontokostas Dimitris >> >> >> >> >> > > > ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
