Comments inline as usual:
On 1/29/13 12:50 PM, "Andy Seaborne" <[email protected]> wrote:
There aren't many compatibility issues at the moment and preserving the
name as the interface would be better compatibility anyway. IMO The
interface name is more important then implementation name so I think
that it gets priority.
1/ shall we make the interface PrefixMap, and call current PrefixFix
PrefixFixStd, and have a factory?
Yes this is what I would like to do, I didn't do this initially since I
wanted to avoid breaking anything to start with. Since we're doing a
minor version bump I think the change is acceptable. I will aim to get
this done this afternoon.
Other future implementations include one for XML-valid prefixing for
example as Turtle and XML are diverging.
The cost of finding a perfect Turtle one vs fast may become significant
as well. There are escapes in the local par tof a prefix name in
Turtle-1.1.
Yes that's an interesting challenge, I may refactor the code slightly to
make it easier to swap in and out different local name validation
routines.
The abbrevKey used in FastPrefixMap is the URI upto the last '#' or
failing that '/'.
This is a prefix look up only when two prefix URI abbrevKeys share a
prefix.
This is rare (I claim!). I added some tests to
AbstractTestLightweightPrefixMap to investigate ...
and the two implementations differ here. The FastPrefixMap does not
abbreviate where the old code does.
This was the case when the IRI was precisely the namespace IRI so you
expect an abbreviation of prefix: ?
I added a fix for that already after I saw your extra tests.
I have an idea for a better fix which is to add a couple of additional
methods to the Trie class which will find either the shortest or longest
match possible. That would give behaviour more akin to the current
PrefixMap implementation.
Although to be honest the fact that the Map based implementation picks the
shorter over the longer abbreviation is going to be down to luck and the
underlying Map implementation.
2/ In the general purpose PrefixMapStd class, we could keep a Map from
registered URI string to prefix to speed up that common case.
A Map is smaller than the Trie (which has a map at each level, so
http:// is 7 maps). It does not optimize cases where one prefix
possibility is a prefix of another, and it falls back to brute force
here. Does this case matter? (Using the abbrevKey and stripping back
the possibilities would be a possibility.)
My proposed additions to Trie and modifying the code to use that should
take care of that. I suspect that the brute force approach is likely not
necessary once the other changes are in place.
This all gets a lot more complicated for RDF 1.1 where the local part
can have escaped / and # in it.
Yes, that'll be a bridge to cross when we get to it.
Over-engineering 1:
Delay the Tries creation until the first call of abbrev or at the end of
parsing (the case of many small files concatenated might have had a lot
of prefixes)
Or build the Trie during output only? if there is an abbrevKey, find
the abbreviate old style, and install it. This is using in a cache
fashion.
I will likely do some refactoring to make Trie sparse where possible.
Otherwise the suggestion of OutputPrefixMap looks good.
Over-engineering 2:
I thought a bit about two interfaces - one for parsers, one for writers
but it seem to be quite complicated for little or no value. What's more
a prefix map may start by being used for parsing and then used for
writing (if and when old PrefixMapping gets sorted which would be good
at Jena3 - it's XML centric).
Yes I think the only real difference is that one would have expand() and
the other abbreviate() but otherwise they would be identical so seems odd
to separate.
I'm happy to put time in to help with changes.
Let me have a go at refactoring a bit based on your suggestions and my
ideas and then you can review the changes and see what further suggestions
or room for improvements you have.
Rob
Andy