Hi Bill,

Thank you for such a detailed analysis :)

With regards to how many PIDs might be seen in actual deployments,
this document has some measurement results from one specific
deployment:
  http://tools.ietf.org/html/rfc5632
Of course, that is one specific case.

Have you attempted to see how big the maps are after being compressed?
 I'd be curious to see what the raw size differences are in that case.
 When we were doing the trials with P4P, enabling compression on the
HTTP client and server was one of the first things that our
integration testing guided us to do :)

I'm fine with changing the structure of the map if really desired, but
I can give a bit of rationale for the current one :)  One of the
reasons we went with JSON (and a text encoding in particular) was to
make messages more humanly-readable to ease debugging.  Having the map
structured as "source -> { destination -> cost }" made it easier to
look at a map immediately to determine the cost for a (source,
destination) from only visible inspection.  Similarly to your other
comments, I'd like to hear from others in the WG before making a
change to the base protocol (after all, it is a WG document :) ).

A couple of additional comments inline:

On Tue, Sep 13, 2011 at 7:35 AM, Bill Roome <[email protected]> wrote:
> How many PIDs do we expect an ALTO server to handle?
>
> I ask because I've discovered that code that works fine with a few hundred
> pids dies when I go to 5,000 pids.
>

I would expect most servers to have *some* upper bound given that
memory is not an infinite resource :)

> The biggest problem is the full cost map response message. With 5,000
> pids, that has 25 million cost entries. When I went to 5,000 pids, my
> off-the-shelf JSON parser ran out of memory. Granted I'm using java, so
> there's a big memory hit there. But I gave the JVM a 4 gigabyte heap, and
> it STILL ran out of memory.
>

The way that I approached this problem was to use a SAX-style JSON
parser. Then the in-memory cost-map can be populated as the response
is read.

A similar thing can be done at the server-side -- one could have the
response generated on-the-fly (using HTTP chunked encoding) from the
cost map stored in RAM.  Then there's only a small, fixed amount of
RAM used per connection instead of buffering the full response before
sending it.

I would imagine that such things would need to be done if an
implementation or deployment really does wish to handle cost maps /
network maps of *any* size.

> Eventually I was able to parse a 5k-pid CostMap, but I had to completely
> re-write the JSON parser.
>
> So my first concern is that if we really expect servers to have 5,000
> pids, and if we expect clients to ask for full cost maps, then we need to
> warn client authors. Otherwise someone will write a simple client, test it
> with a simple ALTO server ... and then have it die when it encounters a
> production ALTO server.
>

Some of this probably boils down to testing practices :) I would hope
that integration tests would make use of realistic data, and still
fail gracefully if it encounters an ALTO Server that returns data much
larger than was ever expected in testing.

It seems like we're getting quite a bit of feedback about
implementation-level aspects of the ALTO protocol. Do you think some
of these things would better go into the Deployment Considerations
document rather than the protocol specification?

> My other concern is that the current JSON encoding of a full cost map is
> very large and inefficient. For 5,000 pids, a full cost map takes 417
> megs. And that's with relatively short pid names (pid_1 thru pid_5000).
> Granted that can be compressed, but it's still a lot of data.
>
> If we're concerned about maps with large pid counts, there is much more
> efficient way to present the cost data. Instead of the nested dictionaries
> with multiple repetitions of the pid names, we could use three arrays:
>
>        srcs
>        dests
>        costs
>
> "srcs" & "dests" would be arrays of pid names. The cost array would have
> nSrc x nDest costs. costs[0] would be the cost from srcs[0] to dests[0],
> costs[1] would be the cost from srcs[0] to dests[1], etc. That is, "costs"
> is the cost matrix in row-major order.
>
> For example, where the current encoding uses
>
>        "map":{
>                "pid1": {"pid3":1, "pid4":2}
>                "pid2": {"pid3":3, "pid4":4}
>        }
>
> the alternate encoding would be:
>
>        "srcs":  ["pid1","pid2"],
>        "dests": ["pid3","pid4"],
>        "costs": [1,2,3,4]
>
>
> That's not impressive for small pid counts, but it makes a big difference
> for large counts. For example, for 5,000 pids:
>
>        current encoding:   417 megs
>        alternate encoding: 148 megs
>
>
> What are your thoughts?

What I said above, and also another thanks for your detailed feedback
on the protocol and implementation :)

Rich

>
>        - Bill Roome
>
>
> _______________________________________________
> alto mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/alto
>
_______________________________________________
alto mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/alto

Reply via email to