[OSM-dev] Split osm line with perl
Does anyone have an idea (or is there already a routine) how to split a line of an osm file in its respective keys and values? I've tried a few things, but I'm not fluent in perl. My problem at the moment is that splitting a line on the space character seems logical, but you run into problems if a value has a space in it. So splitting something like tag k=name v=foo bar/ will split the value foo bar also. Regards, Maarten ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
Maarten Deen wrote: I've tried a few things, but I'm not fluent in perl. My problem at the moment is that splitting a line on the space character seems logical, but you run into problems if a value has a space in it. So splitting something like tag k=name v=foo bar/ will split the value foo bar also. You have to be fluent in regexes, not perl as such. The trick is to match the quote, then to match anything that is not a quote, followed by a quote. #!/usr/bin/perl my $str = tag k=\name\ v=\foo bar\/; $str =~ /k=([^]*) v=([^]*)/; my ($k, $v) = ($1, $2); print k = '$k', v = '$v'\n; Difficulty: are values with quotes allowed in k/v pairs? -- Lennard ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
On Sun, Nov 29, 2009 at 11:41 AM, Lennard l...@xs4all.nl wrote: Maarten Deen wrote: I've tried a few things, but I'm not fluent in perl. My problem at the moment is that splitting a line on the space character seems logical, but you run into problems if a value has a space in it. So splitting something like tag k=name v=foo bar/ will split the value foo bar also. You have to be fluent in regexes, not perl as such. The trick is to match the quote, then to match anything that is not a quote, followed by a quote. And then hope that the attributes are in the order you're expecting, and that the XML has used rather than '. And in the example code given below hope that only one space was used. If you know the OSM XML source that's probably not such a massive issue. Don't forget to unescape any XML entities in the key or value as well. The real trick of course is to use an XML parser which handles all of this for you. Dave ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
Dave Stubbs wrote: And then hope that the attributes are in the order you're expecting, and that the XML has used rather than '. And in the example code given below hope that only one space was used. Adding that to the example code I gave would detract from the actual matching of what he asked, since he seemed to be stuck on using 'split' instead of a regex. I know it isn't ideal, given the way XML works, but he'll have to work out those difficult bits for himself. ;-) -- Lennard ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
On Sun, Nov 29, 2009 at 12:16, Maarten Deen md...@xs4all.nl wrote: I've tried a few things, but I'm not fluent in perl. My problem at the moment is that splitting a line on the space character seems logical, but you run into problems if a value has a space in it. wouldnt be wiser to use a DOM/XML parser. which is native able to interpret XML? i did something like the thing you are trying to accomplish using PHP XML parser a while ago. -- -S ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
Hi, Dave Stubbs wrote: The real trick of course is to use an XML parser which handles all of this for you. And decodes UTF8 along the way, and all this in just a few hours if you're lucky (if you happen to use the native Perl XML parsing which is used by default and without warning unless you have proper modules installed, then parsing a planet may also take the better part of a day). It all depends on what you want to do. If you have a planet file and quickly want to count how many different values of a certain key are used therein, nothing beats a quick curse on the shell command line, or a mini Perl script. Bye Frederik -- Frederik Ramm ## eMail frede...@remote.org ## N49°00'09 E008°23'33 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
On Sun, Nov 29, 2009 at 6:41 AM, Lennard l...@xs4all.nl wrote: Difficulty: are values with quotes allowed in k/v pairs? Yes, but they are escaped (usually as quot;). ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
On Sun, Nov 29, 2009 at 6:41 AM, Lennard l...@xs4all.nl wrote: $str =~ /k=([^]*) v=([^]*)/; If you don't care about order or number of spaces or anything like that, a simple if ($line=~/^tag k=(.*) v=(.*) /$/) will do. The code I gave was actually for parsing changeset and node tags, which weren't as uniformly formatted as the tag tags, so my attempts at parsing them using regular expressions got way too complicated (it might be possible with backreferences and such, but split worked a lot better). ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
On Sun, Nov 29, 2009 at 12:10, Simone Cortesi sim...@cortesi.com wrote: On Sun, Nov 29, 2009 at 12:16, Maarten Deen md...@xs4all.nl wrote: I've tried a few things, but I'm not fluent in perl. My problem at the moment is that splitting a line on the space character seems logical, but you run into problems if a value has a space in it. wouldnt be wiser to use a DOM/XML parser. which is native able to interpret XML? Yes it would. Unfortunately some Perl programmers seem to be unaware of the existence of CPAN and insist on solving non-trivial problems like XML parsing over and over again with the wrong tools, namely regular expressions; If you want a Perl one-liner to get all tag values from a OSM file here's one on the house that isn't insane: perl -CI -MXML::Parser -E 'my $x = XML::Parser-new(Handlers = { Start = sub { my ($p, $e, %kv) = @_; return unless $e eq tag; say $kv{k} = $kv{v} } }); $x-parse(*STDIN)' File.osm This could probably done in an easier way using something higher level than XML::Parser (which is just a raw interface to expat) but I'm not that familiar with Perl XML parsing. If I were to acquaint myself with it I'd be sure not to start by writing the millionth buggy tagsoup parser using regexes though. ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
Hi, Ævar Arnfjörð Bjarmason wrote: If I were to acquaint myself with it I'd be sure not to start by writing the millionth buggy tagsoup parser using regexes though. As I said, a good craftsperson will know all available tools and choose the one that best suits the job, and not disregard a whole family of tools just because he believes them to be inferior (or uncool). If you are dealing with the kind of XML emitted by Osmosis, you can make assumptions about the structure. Assumptions that will break if you try to deal with other files of course, but assumptions that make things faster as long as you stay within the envelope. Assume you want to count the different values for the highway tag. The following millionth buggy tagsoup parser (which anyone familar with Perl can write without looking up the details of an XML parser library) does this for Germany in 108 seconds: perl -e 'while() {$count{$1}++ if (/tag k=highway v=([^]*)/); }; foreach (sort { $count{$b}=$count{$a}} keys %count) { printf %6d %s\n,$count{$_},$_; }' Your XML parser based code into which I injected the same counting routine, perl -CI -MXML::Parser -E 'my $x = XML::Parser-new(Handlers = { Start = sub { my ($p, $e, %kv) = @_; return unless $e eq tag; return unless $kv{k} eq highway; $count{$kv{v}}++; } }); $x-parse(*STDIN); foreach (sort { $count{$b}=$count{$a}} keys %count) { printf %6d %s\n,$count{$_},$_; }' arrives at the same result in 915 seconds, that's a 850% performance penalty. Yes, the primitive version will choke if there's a line break or if someone uses ' instead of ; it doesn't decode UTF-8 properly and it will not resolve entities. Your version does all this, and precisely because it does, takes four times longer. A good programmer should be aware of this, and not pay for the XML parser bells and whistles if he doesn't need them. I may be a bit old-fashioned but I had to take exception to the arrogance that spoke from your post. It is exactly that kind of attitude that I often see in young programmers: I implemented this by the book and it doesn't go any faster. - But how do you expect us to run this on a nightly basis when your code takes 28 hours to run? - Use more machines, dude. Never heard of map/reduce? - and all that because they are too snotnosed to parse XML with a regex if required. I'm not calling for premature optimisation, and nothing would be more stupid than trying to parse a 100-line user-written config file with anything else than a proper and tested XML parser. But discounting regex-based XML parsing outright, without having some knowledge about the cost incurred, is imprudent, and does not go well with the air of superiority that you gave off. Bye Frederik -- Frederik Ramm ## eMail frede...@remote.org ## N49°00'09 E008°23'33 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Split osm line with perl
On Sun, Nov 29, 2009 at 18:43, Frederik Ramm frede...@remote.org wrote: Ævar Arnfjörð Bjarmason wrote: If I were to acquaint myself with it I'd be sure not to start by writing the millionth buggy tagsoup parser using regexes though. As I said, a good craftsperson will know all available tools and choose the one that best suits the job, and not disregard a whole family of tools just because he believes them to be inferior (or uncool). If you are dealing with the kind of XML emitted by Osmosis, you can make assumptions about the structure. Assumptions that will break if you try to deal with other files of course, but assumptions that make things faster as long as you stay within the envelope. Assume you want to count the different values for the highway tag. The following millionth buggy tagsoup parser (which anyone familar with Perl can write without looking up the details of an XML parser library) does this for Germany in 108 seconds: perl -e 'while() {$count{$1}++ if (/tag k=highway v=([^]*)/); }; foreach (sort { $count{$b}=$count{$a}} keys %count) { printf %6d %s\n,$count{$_},$_; }' Your XML parser based code into which I injected the same counting routine, perl -CI -MXML::Parser -E 'my $x = XML::Parser-new(Handlers = { Start = sub { my ($p, $e, %kv) = @_; return unless $e eq tag; return unless $kv{k} eq highway; $count{$kv{v}}++; } }); $x-parse(*STDIN); foreach (sort { $count{$b}=$count{$a}} keys %count) { printf %6d %s\n,$count{$_},$_; }' arrives at the same result in 915 seconds, that's a 850% performance penalty. Yes, the primitive version will choke if there's a line break or if someone uses ' instead of ; it doesn't decode UTF-8 properly and it will not resolve entities. Your version does all this, and precisely because it does, takes four times longer. A good programmer should be aware of this, and not pay for the XML parser bells and whistles if he doesn't need them. I may be a bit old-fashioned but I had to take exception to the arrogance that spoke from your post. It is exactly that kind of attitude that I often see in young programmers: I implemented this by the book and it doesn't go any faster. - But how do you expect us to run this on a nightly basis when your code takes 28 hours to run? - Use more machines, dude. Never heard of map/reduce? - and all that because they are too snotnosed to parse XML with a regex if required. I'm not calling for premature optimisation, and nothing would be more stupid than trying to parse a 100-line user-written config file with anything else than a proper and tested XML parser. But discounting regex-based XML parsing outright, without having some knowledge about the cost incurred, is imprudent, and does not go well with the air of superiority that you gave off. I think Perl's regex engine is cool, in fact if you're using it you're using my code. However when a self-admitted Perl newbie starts a thread saying he's already split up an XML file by lines and inquires about how he can parse those lines it's worth stepping back and asking if that's really the approach he wants to be taking. In most cases the answers given in this thread are the right answers to the wrong question. Admittedly my response was a bit snotty mostly because I've spent untold hours maintaining large swaths of Perl code which for no good reason reinvented something for which there was a perfectly good library in a buggy manner with no documentation. A lot of Perl programmers really do have no idea how to use CPAN judging by the amount of code they churn out which duplicates well-known and tested CPAN modules with their own badly reinvented wheels. Of course there are cases where the libraries aren't sufficient as you rightly point out but nothing about Maarten's question indicated that this was the case. Sometimes you have to dig yourself into the hole of implementing maintaining your own tagsoup parser but I wouldn't help a newbie dig that hole for himself unless I was certain that was what he really needed. And by the way your program would be slightly faster if you used (.*?) instead of ([^]*). Minimally greedy matching is faster than using negated character classes. ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev