On 9 February 2017 at 13:38, Stian Soiland-Reyes <[email protected]> wrote:
> Peter Ansell raised a valid question about why in Commons RDF I made the
> RDFParser interface as an immutable factory-like builder pattern.
>
> https://commons.apache.org/proper/commons-rdf/apidocs/org/apache/commons/rdf/experimental/RDFParser.html
>
>
> Here is how it can be used today:
>
>  RDF rdf = new JenaRDF();
>  Graph g1 = rdf.createGraph();
>  Future future = new JenaRDFParser()
>          .source(Paths.get("/tmp/graph.ttl"))
>          .contentType(RDFSyntax.TURTLE)
>          .target(g1).parse();
>  future.get(30, TimeUnit.Seconds); // block
>
>
> You may notice it is not a factory as-such, as you only get a Future
> of the parser result, rather than the actual "parser". Also it is
> currently an interface, with multiple implementations (one per RDF
> implementation).
>
>
>
> Although the javadoc only suggests for it to be immutable and
> threadsafe, the abstract class AbstractRDFParser
> https://commons.apache.org/proper/commons-rdf/apidocs/org/apache/commons/rdf/simple/experimental/AbstractRDFParser.html
> https://github.com/apache/commons-rdf/blob/master/simple/src/main/java/org/apache/commons/rdf/simple/experimental/AbstractRDFParser.java
> is both, and thus every "option setter" method therefore return a cloned
> copy of the factory with the mutated fields set there. This simply uses
> .clone() at the moment.
>
> This abstract class and immutability pattern is used by each of the
> three RDF implementations of the interface.
>
>
> This means it's safe to parse many files like this:
>
>  JenaRDFParser parser = new JenaRDFParser()
>          .contentType(RDFSyntax.TURTLE)
>          .target(g1);
>
>  // Parse multiple files in parallell
>  Future future1 = parser.source(Paths.get("/tmp/graph1.ttl")).parse();
>  Future future2 = parser.source(Paths.get("/tmp/graph2.ttl")).parse();
>
>  // Wait for both to finish
>  future1.get(30, TimeUnit.Seconds);
>  future2.get(30, TimeUnit.Seconds);
>
>
> Above the second call to parser.source(..) is not a problem (even if
> it happens in another thread) as both .source() calls return separate
> cloned JenaRDFParser instances and don't mutate the "parser" instance.
>
>
> Now a valid critique of such a pattern is that it's wasteful to make
> clones that are immediately thrown away, for instance the above
> future1/future2 example would cause 7 instantiations of JenaRDFParser()
> (.parse() does some finalization in a final clone).
>
> I have not done any benchmarking, but envision that as parsing RDF files
> generally creates many thousands of RDFTerm instances, a couple of
> extra parser instantiations shouldn't make a big dent in the bill.
>
>
> The typical builder pattern is rather to return "this" and mutate fields
> directly. This is obviously not thread-safe but generally achives the
> same. Multiple threads would however instead have to create a brand new
> parser from start and set all the options themselves.  (Or they could a
> similar .clone() method from a carefully locked down "proto" instance
> that everyone has to be careful not to call any mutating methods on)
>
>
> Now multiple threads is now very common because of Java 8 streams, and
> for instance it would not be unthinkable that a parallelStream() is
> mapped as a .source() or .target() on such a parser builder. The Futures
> fit somewhat into this picture as well.   I think this would get
> complicated with having to set everything again rather than just pass
> "parser::source" as a method reference.
>
>
>
>
> As this was just thrown together as an experiment (and we've not yet
> tried writing the corresponding RDFWriter interface), I think we could
> revisit how to do this pattern, hopefully drawing on the wider
> experience of the Apache Commons developers on this list.
>
>
>
> Roughly here are my requirements:
>
> * Ability to set various options (e.g. path of file to parse, format,
>   destination) in any order

If setting each of those independently in single-arg methods is a
requirement you have already defined the API, as there is only one
possible user-visible way to do that. If you are flexible on this,
then there are a few different options.

> * Avoid mega-constructors/multi-arg methods

Mega-constructors are definitely out, but by avoiding all multi-arg
methods you are pushing a lot of state into the object relating solely
to the subject of the parse operation. Hence why you are getting into
knots trying to decide how to isolate the options that are independent
from a specific parse operation, such as "PRESERVE_BNODE_IDS", to use
an example from RDF4J, from those that are directly dependent on each
parse, namely, the concrete RDFSyntax, the Base URI for the document,
and the source.

Setting the directly dependent options (those specific to a single
parse operation) in the same fashion as the global options is the
source of the confusion about when to allow mutation to allow the
dependent options to be set before freezing them.

Are you okay with having the independent settings done upfront, and
then all three of the directly dependent parse options done as
arguments to the parse method, as it would then be a multi-arg method?

> * Some options take multiple types, e.g a source can be given as a Path,
>   InputStream or IRI (URL) -- but would override each other
>
> Nice to have:
>
> * Reusable (e.g. set options that are common, but modify
>   source() to parse multiple files)
> * Thread-safe (e.g. immediate reuse of builder before a parsing session
>   has started or finished)
> * Easy to serialize / keep around (should not keep references to big
>   objects in memory uneccessarily)
> * Parsing in the background, e.g. return Future. (implementation manages
>   threads/queues as it pleases)
>
> Unclear:
>
> * How to select between the multiple RDFParser instances?
>   I was thinking the RDF interface could have a .makeParser() method,
>   but not all parsers can take all formats, so possibly some kind of
>   registry or "what can you handle" mechanism might be needed.

I would prefer if the RDFSyntax (one-to-one mapping generally to an
RDFParser implementation), was either selected upfront
"RDFParserFactory.for(RDFSyntax)", or done as late as possible
"RDFParserFactory.create().set(...).set(...).parse(RDFSyntax, String,
InputStream)". In both cases you can setup the actual internal parser
at the last second, so it is more for stylistic purposes than
usability. The RDFSyntax in both is a static constant and would simply
be stored in a field somewhere.

In both cases, a registry is indeed required as otherwise you would
need to directly reference a particular parser implementation, which
we are trying to get away from with Commons RDF. Whether the registry
is inside of CommonsRDF or just a wrapper to those already provided by
Jena/RDF4J/JSONLD-Java/etc. is a different question and that can be
worked on once the base API is setup.

>
> How do you think we should proceed? Mutable or immutable? How should the
> settings be kept? Fields, map, or what?  Does it make sense with an
> interface, abstract class (keeps settings) and implementations
> (processess settings), or should we have a single ParserFactory class
> and have a new internal interface below?

Personally I don't mind mutable (with state copied to independent
variables/maps inside of the terminal parse method) as long as
settings are kept in threadsafe structures like ConcurrentMap rather
than Map or hacky protected variables for each case.

For reference, my preference is for one of the following two patterns:

RDFParser parser = RDFParserFactory.for(RDFSyntax).set(RDFSetting<T>,
T).set(RDFSetting<X>, X);

parser.parse(String, InputStream);
ForkJoinPool.commonPool().submit(() -> parser.parse(String, IRI));
//etc.

or:

RDFParserFactory factory =
RDFParserFactory.create().set(RDFSetting<T>, T).set(RDFSetting<X>,
X).target(RDFSink);

factory.parse(RDFSyntax, String, InputStream);
ForkJoinPool.commonPool().submit(() -> factory.parse(RDFSyntax, String, IRI));

In the both cases the parse method could be fairly similar, with the
actual internal parser implementation lazily created whether the user
sees RDFParserFactory or RDFParser as the Commons RDF type:

    public void parse(String baseIRI, InputStream source) {
        // Only the first call here is subject to threading changes,
as we take a defensive copy anyway.
        // You could lock on it but if users are silly enough to punch
in new global settings after calling parse for the first time, then
they should expect the unexpected
        ConcurrentMap<RDFSetting<?>, Object> clonedSettings = new
ConcurrentHashMap<>(mySettings);

        // Then create the implementation specific parser for the syntax
        RDFParser myParser = Rio.createParser(
                    Rio.getFormatForMimeType(mySyntax.getContentType())

.orElseThrow(Rio.unsupportedFormat(mySyntax.getContentType())));

        // Propagate settings down into the internal parser from our
defensive copy
        clonedSettings.entrySet().forEach(s ->
myParser.set(s.getKey(), s.getValue()));

        // Use the sink as the RDFHandler/etc.
        myParser.setRDFHandler(mySink);

        // Then actually initiate the parse
        myParser.parse(source, baseIRI);
    }

As you say, there are only three source types, which only represents
three "multi-arg" parse methods, and dramatically simplifies
discussions about the state, as it is very simple to (shallow) clone a
"ConcurrentMap<RDFSetting<?>, Object>" and presume that users are not
silly enough to put mutable values into that map, than to deal with
questions about when to lock down the pattern outside of a parse
method just to support user-visible single-arg methods for idealistic
purposes.

You could make it a lot more complex with discussions about handling
Future/CompleteableFuture in the RDFParser/RDFParserFactory API.
However, in my opinion that is much better done in either user code
with their preferred ExecutorService, or in a helper, without forcing
everyone to deal with asynchronous programming and the extra threading
costs that come with it.

Cheers,

Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to