Re: General questions

Alexandre Bertails Mon, 11 May 2015 22:20:40 -0700

On Fri, May 8, 2015 at 6:20 AM, Stian Soiland-Reyes <[email protected]> wrote:
> I would also be happy with the createBlankNode(UUID) approach, which
> bypasses the issue of state, but hammers down the uniqueness
> constraint.

I actually didn't understand that we were discussing a
`createBlankNode(UUID)`. I think we just need to be able to create a
fresh blank node. Requiring the caller to provide an explicit UUID
means that the freshness is happening *outside* of the factory, so I
don't see the point. Also, it's forcing the strategy (UUID), which
might not be the best one for everybody, e.g. UUID is known to be
slow, at least for some notion of slow, and that could become a
bottleneck for parsing large files. I think that communicating with
the API using `String`s is fine as long as we focus on the invariants,
and that we have a test suite for checking those invariants.

> Assuming there is also a UUID uniqueReference() returned (or a
> contract-way to ascertain the String actually is a UUID, e.g.
> requiring a "urn:uuid:" prefix) then you can consistently "clone"
> foreign BlankNode impementations if you for some reason need to do so
> (there's been some discussion is this is really needer or not - see
> https://issues.apache.org/jira/browse/COMMONSRDF-15).

That is an interesting use-case, indeed.

> (We can still show the hashing pattern that results in a
> String-derived UUID in the user guide.)
>
>
> I am however not quite buying into the argument of why we need the
> factory to be "fully stateless", except that this is the clean and
> "functional way" in Scala.

I am not arguing for stateless vs stateful. I am just pointing at some
design issues which do not allow it. Currently, there is just no way
for an immutable implementation to be used with such a factory.

> As a hobby Clojure user I would welcome
> functional alternatives like add(Triple) that returns a new Graph -
> but I don't expect Commons-RDF to necessarily be able to play along
> with those paradigms at the same time as it is a useful "bog standard"
> object-oriented Java API.

Having `add` returning a `Graph` does not mean that `Graph` is
immutable. It just means that it *enables* `Graph` to be immutable.

> But perhaps this is relevant to the
> https://issues.apache.org/jira/browse/COMMONSRDF-20 discussions as
> there would be one argument less for needing multiple factory
> instances.  (You might still want multiple if they are directly
> related to the underlying storage mechanism, but then they would also
> need to be configured, thus instantiated through other means like an
> explicit constructor)
>
>
> I think adjustments for Scala are welcome, but should be without
> causing too much a cost for Java users

Well, Scala is just a language. Immutability and referential
transparency, are just principles, but they are becoming more and more
important in many areas (Spark, concurrency, etc.).

It is of course totally fine if this group decides to target only Java
people, and make choices that appeal to that community. But again,
that means closing the door to an entire world...

> (implementers can take a tiny bit more pain)

I totally agree! See below for a discussion.

> - so I think  createBlankNode(UUID) is a very clean
> approach, while your suggested
> interface RDF<Graph, Triple, RDFTerm, BlankNodeOrIRI, IRI, BlankNode, Literal>
> looks very contrived in Java and potentially difficult to use.
>
> Alexandre, your gist approach is functionally clean - in a Clojure
> implementation it would also be natural to take shortcuts straight to
> String/URI/UUID like that.

There is no shortcut at all. The RDF model only resolves around some
types (Graph, Triple, RDFTerm, BlankNodeOrIRI, IRI, BlankNode,
Literal) which can be left abstract, as opposed to being concrete when
using Java's interfaces. (it's "concrete" in the sense it's using
nominal subtyping)

If you look at what I did, you have a *direct* translation of the
existing interfaces+methods+factory into simple functions.

> ..but I am not sure I understand the implications for Java-land. Could
> you perhaps draft out what you suggest the Commons-RDF interfaces
> would look like when defined in Java?

My gist *was* a draft of what Commons-RDF could be. The translations
works as follows:
* the Java interfaces becomes abstract types
* the methods on those interfaces become functions on the abstract types
* the methods on the interfaces in the factory becomes simple
functions on the abstract types
* operating on a node happens with a visitor (as in visitor pattern)
implemented as the `visit` function, taking 3 functions for the 3
possible cases (I believe the current API asks for checking the class
at runtime...)

Now, let's say I am implementing a Turtle parser. The only thing I
care about is how I can [use case 1] create/inject elements into some
existing RDF model. If I am writing a Turtle serializer, I only care
about how to [use case 2] traverse that type hierarchy. In none of
those cases did I care about having the types defined in the
class/interface hierarchy and I want anybody to use their own RDF
model.

Now let's say I am a user. I care about

* [use case 3] type safety: if I see a function/method, I assume it is
defined, not that it will blow up in my face with a
`UnsupportedOperationException`
* [use case 4] knowing about the actual types: even if
Jena/Sesame/Plantain all comply with the interfaces in commons-rdf,
they most certainly differ in some other areas. If users have to use
instanceof+cast, then it's a design failure.
* some times, I am perfectly happy with the common interfaces, and I
want the _convenience_ of [use case 5] calling methods defined on the
common interfaces

Finally, as a power-user, I might want to define my own RDF model
which would be [use case 6] highly-customized and mutable, or [use
case 7] immutable and persistent.

I would argue that the current approach in commons-rdf only addresses
[use case 5]. [use case 3] could be addressed with new interfaces and
different API design but well, it's more interfaces.

The approach I propose can do all of that.

Abstract types (aka. generics) leave the implementation open. Only
want `String`s? Here is what you do:

```
class StringRDF extends RDF<Set<Tuple3<String, String, String>>,
Tuple3<String, String, String>, String, String, String, String,
String> { /* implementation goes here */ }
```

Note that the current API is actually just a weaker form of my
proposal (in the mathematical sense), as you can always write
something like that:

```
import org.apache.commons.rdf.api.*;
class RDFTermFactory extends RDF<Graph, Triple, RDFTerm,
BlankNodeOrIRI, IRI, BlankNode, Literal> { /* implementation goes here
*/ }
```

So you do *not* loose what's in the current API.

Here is how a library author would write a Turtle parser:

```
class TurtleParser<Graph, Triple, RDFTerm, BlankNodeOrIRI, IRI,
BlankNode, Literal> {
  RDF<Graph, Triple, RDFTerm, BlankNodeOrIRI, IRI, BlankNode, Literal> rdf
  Graph parse(String input) { /* can call rdf.createLiteral("foo"), or
anything in rdf.* */ }
}
```

Now the user can build a parser using the model she wants e.g. with Jena:

```
import com.hp.hpl.jena.graph.*;
class JenaTurtleParser extends TurtleParser<Graph, Triple, Node, Node,
Node_URI, Node_Blank, Node_Literal> {
  RDF<Graph, Triple, Node, Node, Node_URI, Node_Blank, Node_Literal>
rdf = JenaRDF.singleton()
}
```

Note that `JenaTurtleParser.parse(String)` now returns a
`com.hp.hpl.jena.graph.Graph`, not just some random Graph!

"Small" remark: I still don't think that `createBlankNode(String)`
belongs to the RDF model. I would really like to see a use case that
shows why it has to be present.

Finally, I will admit that writing all those types parameters can be a
bit cumbersome, even if it happens only in a very few places (as a
user: only once when you build what you need e.g. a Turtle parser).
But please let's not sacrifice correctness and functionality to (a
little) convenience...

Alexandre

> On 7 May 2015 at 21:03, Andy Seaborne <[email protected]> wrote:
>> On 07/05/15 20:04, Alexandre Bertails wrote:
>>>
>>> Hi Andy,
>>>
>>> On Thu, May 7, 2015 at 3:27 AM, Andy Seaborne <[email protected]> wrote:
>>>>
>>>> On 07/05/15 00:24, Stian Soiland-Reyes wrote:
>>>>>>
>>>>>>
>>>>>> `RDFTermFactory` is stateful just to accommodate
>>>>>>>
>>>>>>> `createBlankNode(String)`. It's stateless otherwise. This is really an
>>>>>>> issue for banana-rdf as everything is defined as pure function (the
>>>>>>> output only depends on the input).
>>>>>
>>>>>
>>>>> It does not need to be stateful.
>>>>>
>>>>> In simple we implemented this using a final UUID "salt" that is
>>>>> created per instance of the factory. Do you consider this state?
>>>>>
>>>>>
>>>>>
>>>>> https://github.com/apache/incubator-commonsrdf/blob/master/simple/src/main/java/org/apache/commons/rdf/simple/SimpleRDFTermFactory.java#L51
>>>>>
>>>>>
>>>>> This is then used by
>>>>>
>>>>>
>>>>> https://github.com/apache/incubator-commonsrdf/blob/master/simple/src/main/java/org/apache/commons/rdf/simple/BlankNodeImpl.java#L41
>>>>> as part of a hashing to generate the new uniqueReference().  Thus a
>>>>> second call on the same factory with the same name will be hashed with
>>>>> the same salt, and produce the same uniqueReference(), which makes the
>>>>> second BlankNode equal to the first.
>>>>>
>>>>>
>>>>> But you can achieve the contract by other non-stateful means, for
>>>>> instance a random UUID that is static final (and hence no state at all
>>>>> per factory instance), and you can create a uniqueReference() by
>>>>> concatenating that UUID with the System.identityHashCode() of the
>>>>> factory and concatenate the provided name.
>>>>>
>>>>>
>>>>> Also you are not required to implement createBlankNode(String) - you
>>>>> can simply throw UnsupportedOperationException and only support
>>>>> createBlankNode().
>>>>>
>>>>>
>>>>> This should probably be noted in the (yet so far just imagined)
>>>>> Implementors Guide on the Commons RDF website.
>>>>>
>>>>> https://issues.apache.org/jira/browse/COMMONSRDF-24
>>>>
>>>>
>>>>
>>>> To add experience about createBlankNode(String): this string is not the
>>>> unique reference.
>>>>
>>>> This is a purely practical matter that arises when parsing at scale. The
>>>> external Map<String, BlankNode> approach fails because that Map grows
>>>> huge.
>>>> N-Triples is worse because it does not have unlabelled blank nodes i.e.
>>>> [].
>>>>
>>>> When the Map style mapping has provided in the parser, there have been
>>>> reports of parsing taking up too much space, coupled with the fact that
>>>> Java's HashMap isn't very good at going from small to verge large
>>>> (significant internal resizing, and also GC issues).
>>>>
>>>> To avoid that, the factory, having direct access to the BlankNode
>>>> implementation can do better using a "salt+combine with the string"
>>>> approach.  No growing space is needed for the syntax label tracking.
>>>>
>>>> The other design I know of is to expose a factory operation to make a
>>>> blank
>>>> node from a unique reference.  On balance, people seemed to prefer the
>>>> current design.
>>>>
>>>> It is a practical consideration - not a requirement of RDF.
>>>
>>>
>>> I agree. I actually do not care if it's a `java.util.Map` or something
>>> else. It's just "something" that gives you a `BlankNode` from a label,
>>> consistently. It could be an actual Map, or something else based on
>>> UUID, it doesn't matter. What matters is that there is an implicit
>>> state, and I argue that this is not a requirement. Passing that
>>> "something" as an argument, and returning the new state along with the
>>> `BlankNode` (whether it's stateful of stateless), would achieve your
>>> requirements the same way.
>>
>>
>> Function<String, BlankNode>
>>
>> If I follow your argument, that would require either state itself elsewhere
>> or private access to make blank nodes reproducibly because createBlankNode()
>> is the only factory way to create a bnode.
>>
>> So either state still exists, just moved, and if this is a /commons/RDF we
>> have to define that. Or there is another route to making bNodes which
>> defeats the purpose of the factory as it currently is.
>>
>> A stateless factory version is createBlankNode(UUID) but that idea didn't
>> get much traction (I'd be happy with it - at the abstract synatx level, a
>> 1-1 correspondence with UUIDs is just the "arbitrary set" the spec talks
>> about).
>>
>>
>> Actually, better would be String in UUID format or at least align with
>> uniqueReference - details.
>>
>>>
>>> I actually think that `createBlankNode(String)` *should not* belong to
>>> the factory interface. That is something external to the factory in
>>> its essence, which is there just because it makes things "more
>>> convenient". For example, a parser would have it's own state anyway,
>>> whether it lives *in* the factory or *outside* of it.
>>
>>
>> The parser does not have long term state.  See Jena RIOT - they emit a
>> stream of triples.  No state over the run.
>>
>> (except RDF/XML because of the "no reuse" of bnodeid but let's skip over
>> RDF/XML :-).
>>
>>         Andy
>>
>>>
>>> Alexandre
>>>
>>>>
>>>>          Andy
>>
>>
>
>
>
> --
> Stian Soiland-Reyes
> Apache Taverna (incubating), Apache Commons RDF (incubating)
> http://orcid.org/0000-0001-9842-9718

Re: General questions

Reply via email to