Re: [OSM-dev] Complete history of OSM data - questions and discussion

Lars Francke Thu, 12 Nov 2009 09:38:58 -0800

>> I am partly done with my Java version. There are a few
>> questions/problems/remarks:
>
> Is Java really up to this job from a performance point of view?


I haven't done any performance comparisons between planet.c and my
program but I believe that there won't be much of a difference. I
could "rip out" the history part and produce a "current planet only"
program to compare the speeds. I've taken care to select a XML Writer
that is known for its performance. There is Aalto[1] but I've never
used it before so I'm hesitant to use it. I could put the database
reading and the XML writing stuff in two different threads but again
I'm not sure if that'd help or hurt. I know that this is quite a
religious topic but I rarely see big differences in speed between C
and Java. Especially in this case as most of the time will probably be
spent in I/O.

>> - Is there a dump of the database available from just prior to the
>> switch from API 0.4 to 0.5? I could try to use that to merge the
>> history of the segments to the ways (as briefly discussed by Frederik)
>
> There is a dump, but it's a mysql dump so not easily readable. There may be
> a planet around somewhere as well but I don't think it will be synchronised
> to the actual shutdown time or have any history.

If you'd be willing to share the mysql dump (I of course wouldn't need
user- or any other sensitive data) I'd try my best. It can't hurt.
Planet won't be as useful because the history is missing.

>> - Any information on the size (in rows) of the tables would be nice
>> (for testing purposes)
>
> It should be fairly obvious for the main tables as they just contain a row
> for each object.

For the main (current_*) tables, yes. But not for the history tables.
I have no estimate how may versions there are .I could count the
current versions from all elements but if you have a number that'd be
great.

>> - What is the default_statistics_target for the columns/tables in
>> question? Are there any other options set that would affect the query
>> planner? I've seen the query planner make wildly inappropriate
>> decisions so I'll try to check if the statements I use will work. I
>> used the same technique as planet.c and only adapted the queries to
>> versions and history tables.
>
> I'm not quite sure what you think knowing the value of that setting is going
> to help with. You only need to worry about optimising your queries if it
> turns out the planner gets them wrong but it's rarely a problem with
> Postgres especially with the kind of simple queries a dumper uses.

I've had problems in the past with exactly this. "Easy" queries
resorting to thousands of seqscans. I just want to configure my
database as close as possible to the live one for tests. I don't think
there'll be problems either but it doesn't hurt to check.

The statements will look something like this:
1)
SELECT n.id, n.version, n.timestamp, n.changeset_id, c.user_id,
n.visible, n.latitude, n.longitude
FROM nodes n
JOIN changesets c ON n.changeset_id=c.id
ORDER BY n.id, n.version

2) SELECT id, version, k, v FROM node_tags ORDER BY id, version, k

Perhaps you could just check them?

>> - Do I have to take precautions in regards to database/machine/disk
>> load? I could do something like the Auto-Vacuum daemon[2] or
>> monitoring the load.
>
> Auto vacuum is on by default these days I believe. It's not something an
> ordinary user has any control over anyway.

That's not what I meant but I was unclear :)
I meant that I could use the same "method" as the Auto-Vacuum daemon
which pauses regularly (cost-based) to alleviate load. Just as an
example.

Again: Until now the program just reads from the DB and dumps to the
output stream. No special concerns as to the statements or the
performance/load. I just want to pick the "low hanging fruit" as early
as possible and those were the questions I thought of. Some of them
(especially those about the query planner) came from problems I've
experienced with osmdoc.

In the end it is up to you (or whoever decides that) if you want to
use my program, write one from scratch or adapt planet.c to dump the
history. I don't really care either way as long as the end result is
that we eventually have historical OSM data ;-)

Lars

_______________________________________________
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev

Re: [OSM-dev] Complete history of OSM data - questions and discussion

Reply via email to