On Thu, Apr 28, 2011 at 9:02 AM, Eric Wolf <[email protected]> wrote:
> I'm writing a script to extract bits out of the OSM full-planet file.
> The full-planet differs from the regular planet file in several ways.
> One of the biggest is that it contains every version of every object
> ever. I want my script to be able to grab just the latest version up
> to a specified date (also does extracts based on bbox). It's not hard
> to do but I want it to be as fast as possible.
>
> Right now I am making two passes through the file. The first pass, I
> build sets containing the unique ID for each feature I want to keep.
> The second pass outputs what's been selected to be kept. Two passes
> are necessary because nodes are listed before ways. I want to be able
> to grab every node in a way when the way is clipped by the bbox.
>
> The set works great but now I want to save the ID and proper version
> for each object to be kept. Sets are lightning fast, especially for
> simple membership testing across large sets. But now I have two values
> I want to cram into the set.
>
> One way I thought of is to "hash" the version into the ID. I could
> either append it to the end:
>
> ID=12345
> ver=6
>
> hash = 123456
>
> Another is to make the version a decimal:
>
> hash = 12345.6
>
> The least "cute" way to handle it would be to create pairs:
>
> hash = (12345, 6)
>
> I'm not a big fan of "cute tricks" in code. But I also want this to be
> fast. This is running over a file that's quickly approaching 500GB.
> The first option would seem to be fastest.
>
> Of course, the fastest of all would be to just chop ways at the
> bounding box and only run through the file once...
>
> Anyone have an opinion here?
>
> -Eric

I suspect any trick that doesn't yield integer keys won't buy you
much. Would a BTree, using id as key and version as value, help you?
The ZODB package includes very fast and efficient trees and sets and
doesn't require Zope: http://pypi.python.org/pypi/ZODB3/3.10.3. You
might profit from taking this question to Stack Overflow.

-- 
Sean

Reply via email to