[OSM-dev] Split osm line with perl

2009-11-29 Thread Maarten Deen
Does anyone have an idea (or is there already a routine) how to split a line of 
an osm file in its respective keys and values?

I've tried a few things, but I'm not fluent in perl. My problem at the moment 
is 
that splitting a line on the space character seems logical, but you run into 
problems if a value has a space in it.
So splitting something like tag k=name v=foo bar/ will split the value 
foo bar also.

Regards,
Maarten

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Lennard
Maarten Deen wrote:

 I've tried a few things, but I'm not fluent in perl. My problem at the moment 
 is 
 that splitting a line on the space character seems logical, but you run into 
 problems if a value has a space in it.
 So splitting something like tag k=name v=foo bar/ will split the value 
 foo bar also.

You have to be fluent in regexes, not perl as such. The trick is to 
match the quote, then to match anything that is not a quote, followed by 
a quote.

#!/usr/bin/perl

my $str = tag k=\name\ v=\foo bar\/;

$str =~ /k=([^]*) v=([^]*)/;
my ($k, $v) = ($1, $2);

print k = '$k', v = '$v'\n;


Difficulty: are values with quotes allowed in k/v pairs?


-- 
Lennard

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Dave Stubbs
On Sun, Nov 29, 2009 at 11:41 AM, Lennard l...@xs4all.nl wrote:
 Maarten Deen wrote:

 I've tried a few things, but I'm not fluent in perl. My problem at the 
 moment is
 that splitting a line on the space character seems logical, but you run into
 problems if a value has a space in it.
 So splitting something like tag k=name v=foo bar/ will split the value
 foo bar also.

 You have to be fluent in regexes, not perl as such. The trick is to
 match the quote, then to match anything that is not a quote, followed by
 a quote.


And then hope that the attributes are in the order you're expecting,
and that the XML has used  rather than '. And in the example code
given below hope that only one space was used.

If you know the OSM XML source that's probably not such a massive issue.
Don't forget to unescape any XML entities in the key or value as well.

The real trick of course is to use an XML parser which handles all of
this for you.

Dave

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Lennard
Dave Stubbs wrote:

 And then hope that the attributes are in the order you're expecting,
 and that the XML has used  rather than '. And in the example code
 given below hope that only one space was used.

Adding that to the example code I gave would detract from the actual 
matching of what he asked, since he seemed to be stuck on using 'split' 
instead of a regex.

I know it isn't ideal, given the way XML works, but he'll have to work 
out those difficult bits for himself. ;-)

-- 
Lennard

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Simone Cortesi
On Sun, Nov 29, 2009 at 12:16, Maarten Deen md...@xs4all.nl wrote:
 I've tried a few things, but I'm not fluent in perl. My problem at the moment 
 is
 that splitting a line on the space character seems logical, but you run into
 problems if a value has a space in it.

wouldnt be wiser to use a DOM/XML parser. which is native able to interpret XML?

i did something like the thing you are trying to accomplish using PHP
XML parser a while ago.

-- 
-S

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Frederik Ramm
Hi,

Dave Stubbs wrote:
 The real trick of course is to use an XML parser which handles all of
 this for you.

And decodes UTF8 along the way, and all this in just a few hours if 
you're lucky (if you happen to use the native Perl XML parsing which is 
used by default and without warning unless you have proper modules 
installed, then parsing a planet may also take the better part of a day).

It all depends on what you want to do. If you have a planet file and 
quickly want to count how many different values of a certain key are 
used therein, nothing beats a quick curse on the shell command line, or 
a mini Perl script.

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Anthony
On Sun, Nov 29, 2009 at 6:41 AM, Lennard l...@xs4all.nl wrote:
 Difficulty: are values with quotes allowed in k/v pairs?

Yes, but they are escaped (usually as quot;).

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Anthony
On Sun, Nov 29, 2009 at 6:41 AM, Lennard l...@xs4all.nl wrote:
 $str =~ /k=([^]*) v=([^]*)/;

If you don't care about order or number of spaces or anything like
that, a simple if ($line=~/^tag k=(.*) v=(.*) /$/) will
do.  The code I gave was actually for parsing changeset and node
tags, which weren't as uniformly formatted as the tag tags, so my
attempts at parsing them using regular expressions got way too
complicated (it might be possible with backreferences and such, but
split worked a lot better).

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Ævar Arnfjörð Bjarmason
On Sun, Nov 29, 2009 at 12:10, Simone Cortesi sim...@cortesi.com wrote:
 On Sun, Nov 29, 2009 at 12:16, Maarten Deen md...@xs4all.nl wrote:
 I've tried a few things, but I'm not fluent in perl. My problem at the 
 moment is
 that splitting a line on the space character seems logical, but you run into
 problems if a value has a space in it.

 wouldnt be wiser to use a DOM/XML parser. which is native able to interpret 
 XML?

Yes it would. Unfortunately some Perl programmers seem to be unaware
of the existence of CPAN and insist on solving non-trivial problems
like XML parsing over and over again with the wrong tools, namely
regular expressions;

If you want a Perl one-liner to get all tag values from a OSM file
here's one on the house that isn't insane:

perl -CI -MXML::Parser -E 'my $x = XML::Parser-new(Handlers = {
Start = sub { my ($p, $e, %kv) = @_; return unless $e eq tag; say
$kv{k} = $kv{v} } }); $x-parse(*STDIN)'  File.osm

This could probably done in an easier way using something higher level
than XML::Parser (which is just a raw interface to expat) but I'm not
that familiar with Perl XML parsing. If I were to acquaint myself with
it I'd be sure not to start by writing the millionth buggy tagsoup
parser using regexes though.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Frederik Ramm
Hi,

Ævar Arnfjörð Bjarmason wrote:
 If I were to acquaint myself with
 it I'd be sure not to start by writing the millionth buggy tagsoup
 parser using regexes though.

As I said, a good craftsperson will know all available tools and choose 
the one that best suits the job, and not disregard a whole family of 
tools just because he believes them to be inferior (or uncool).

If you are dealing with the kind of XML emitted by Osmosis, you can make 
assumptions about the structure. Assumptions that will break if you try 
to deal with other files of course, but assumptions that make things 
faster as long as you stay within the envelope.

Assume you want to count the different values for the highway tag.

The following millionth buggy tagsoup parser (which anyone familar with 
Perl can write without looking up the details of an XML parser library) 
  does this for Germany in 108 seconds:

perl -e 'while() {$count{$1}++ if (/tag k=highway v=([^]*)/); }; 
foreach (sort { $count{$b}=$count{$a}} keys %count) { printf %6d 
%s\n,$count{$_},$_; }'

Your XML parser based code into which I injected the same counting routine,

perl -CI -MXML::Parser -E 'my $x = XML::Parser-new(Handlers = {
Start = sub { my ($p, $e, %kv) = @_; return unless $e eq tag; return 
unless $kv{k} eq highway; $count{$kv{v}}++; } }); $x-parse(*STDIN); 
foreach (sort { $count{$b}=$count{$a}} keys %count) { printf %6d 
%s\n,$count{$_},$_; }'

arrives at the same result in 915 seconds, that's a 850% performance 
penalty.

Yes, the primitive version will choke if there's a line break or if 
someone uses ' instead of ; it doesn't decode UTF-8 properly and it 
will not resolve entities. Your version does all this, and precisely 
because it does, takes four times longer.

A good programmer should be aware of this, and not pay for the XML 
parser bells and whistles if he doesn't need them.

I may be a bit old-fashioned but I had to take exception to the 
arrogance that spoke from your post. It is exactly that kind of attitude 
that I often see in young programmers: I implemented this by the book 
and it doesn't go any faster. - But how do you expect us to run this 
on a nightly basis when your code takes 28 hours to run? - Use more 
machines, dude. Never heard of map/reduce? - and all that because they 
are too snotnosed to parse XML with a regex if required.

I'm not calling for premature optimisation, and nothing would be more 
stupid than trying to parse a 100-line user-written config file with 
anything else than a proper and tested XML parser. But discounting 
regex-based XML parsing outright, without having some knowledge about 
the cost incurred, is imprudent, and does not go well with the air of 
superiority that you gave off.

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Split osm line with perl

2009-11-29 Thread Ævar Arnfjörð Bjarmason
On Sun, Nov 29, 2009 at 18:43, Frederik Ramm frede...@remote.org wrote:
 Ævar Arnfjörð Bjarmason wrote:

 If I were to acquaint myself with
 it I'd be sure not to start by writing the millionth buggy tagsoup
 parser using regexes though.

 As I said, a good craftsperson will know all available tools and choose the
 one that best suits the job, and not disregard a whole family of tools just
 because he believes them to be inferior (or uncool).

 If you are dealing with the kind of XML emitted by Osmosis, you can make
 assumptions about the structure. Assumptions that will break if you try to
 deal with other files of course, but assumptions that make things faster as
 long as you stay within the envelope.

 Assume you want to count the different values for the highway tag.

 The following millionth buggy tagsoup parser (which anyone familar with Perl
 can write without looking up the details of an XML parser library)  does
 this for Germany in 108 seconds:

 perl -e 'while() {$count{$1}++ if (/tag k=highway v=([^]*)/); };
 foreach (sort { $count{$b}=$count{$a}} keys %count) { printf %6d
 %s\n,$count{$_},$_; }'

 Your XML parser based code into which I injected the same counting routine,

 perl -CI -MXML::Parser -E 'my $x = XML::Parser-new(Handlers = {
 Start = sub { my ($p, $e, %kv) = @_; return unless $e eq tag; return
 unless $kv{k} eq highway; $count{$kv{v}}++; } }); $x-parse(*STDIN);
 foreach (sort { $count{$b}=$count{$a}} keys %count) { printf %6d
 %s\n,$count{$_},$_; }'

 arrives at the same result in 915 seconds, that's a 850% performance
 penalty.

 Yes, the primitive version will choke if there's a line break or if someone
 uses ' instead of ; it doesn't decode UTF-8 properly and it will not
 resolve entities. Your version does all this, and precisely because it does,
 takes four times longer.

 A good programmer should be aware of this, and not pay for the XML parser
 bells and whistles if he doesn't need them.

 I may be a bit old-fashioned but I had to take exception to the arrogance
 that spoke from your post. It is exactly that kind of attitude that I often
 see in young programmers: I implemented this by the book and it doesn't go
 any faster. - But how do you expect us to run this on a nightly basis when
 your code takes 28 hours to run? - Use more machines, dude. Never heard of
 map/reduce? - and all that because they are too snotnosed to parse XML with
 a regex if required.

 I'm not calling for premature optimisation, and nothing would be more stupid
 than trying to parse a 100-line user-written config file with anything else
 than a proper and tested XML parser. But discounting regex-based XML parsing
 outright, without having some knowledge about the cost incurred, is
 imprudent, and does not go well with the air of superiority that you gave
 off.

I think Perl's regex engine is cool, in fact if you're using it you're
using my code.

However when a self-admitted Perl newbie starts a thread saying he's
already split up an XML file by lines and inquires about how he can
parse those lines it's worth stepping back and asking if that's really
the approach he wants to be taking. In most cases the answers given in
this thread are the right answers to the wrong question.

Admittedly my response was a bit snotty mostly because I've spent
untold hours maintaining large swaths of Perl code which for no good
reason reinvented something for which there was a perfectly good
library in a buggy manner with no documentation.

A lot of Perl programmers really do have no idea how to use CPAN
judging by the amount of code they churn out which duplicates
well-known and tested CPAN modules with their own badly reinvented
wheels.

Of course there are cases where the libraries aren't sufficient as you
rightly point out but nothing about Maarten's question indicated that
this was the case. Sometimes you have to dig yourself into the hole of
implementing  maintaining your own tagsoup parser but I wouldn't help
a newbie dig that hole for himself unless I was certain that was what
he really needed.

And by the way your program would be slightly faster if you used
(.*?) instead of ([^]*). Minimally greedy matching is faster than
using negated character classes.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev