[
https://issues.apache.org/jira/browse/ANY23-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hans Brende reassigned ANY23-396:
---------------------------------
Assignee: Hans Brende (was: Jacek Grzebyta)
Priority: Major (was: Minor)
Fix Version/s: 2.3
Description:
This issue began with Jacek's observation that, in Rover, it is impossible to
specify a *delegating writer factory*, i.e., one that maps/filters/reduces the
preliminary extraction output before passing it on to the final outputstream
writer. Lack of this ability caused us to have to specify numerous
configuration flags in Rover, e.g., "--notrivial", which filters the output of
the extractor by removing trivial css triples prior to writing the triples to
their final format. Many of these flags could simply be replaced by the ids of
*delegating writer factories*, if we had such a capability. One added advantage
of that would be that then, users could specify the *order* in which these
modifications take place. E.g., adding a *logging* decorator could take place
before or after the "notrivial" decorator has been applied (or both before
*and* after!). Which? If we can, we should really let the user decide.
The most obvious solution to this problem was to extend the {{WriterFactory}}
interface with a new {{DelegatingWriterFactory}} interface that accepts an
arbitrary {{TripleHandler}} rather than an {{OutputStream}} as input.
In doing so, it was also necessary to deprecate a few methods in
{{WriterFactory}} and un-deprecate them in an extending {{TripleWriterFactory}}
class (which takes the place of {{WriterFactory}} by creating a
{{TripleHandler}} from an {{OutputStream}}). This deprecation was actually not
too painful, first, because some of the methods were redundant in the first
place (e.g., {{getMimeType()}}), and second, because it presented us with a
perfect opportunity to add some much-needed improvements to the new interface.
The biggest improvement is the addition of {{Settings}} as a parameter to the
{{TripleHandler}} constructor, which will allow users to configure writers as
they see fit, rather than forcing, e.g., {{prettyprint=true}} on them.
ANY23-388 perfectly illustrates this current lack of configuration ability. And
we fixed that issue by simply giving users {{protected}} access to the
underlying {{RDFWriter}} instances so that they could configure them manually.
However, in hindsight, this was a bad idea, as it could lead to backwards
compatibility issues down the line if we decide to change the underlying
implementation of {{RDFWriterTripleHandler}} instances. Luckily, the solution
to ANY23-388 was only implemented recently and is still only present in the
snapshot version of Any23. In my PR, I've removed that hack and replaced it
with {{Settings}}, which is extensible ad infinitum and won't pose the same
threat to backwards compatibility.
Another improvement is the removal of RDF4J classes from the public
WriterFactory API. (I replaced {{RDFFormat}} with our own {{TripleFormat}}
class.) As I noted in my PR, it's probably better for us to use our own classes
in public-facing interfaces rather than RDF4J's so that we can maintain
stability in the event that RDF4J changes their API, or (heaven forbid) ceases
to exist, or we simply want to modify the implementation. A good rule of thumb
for us would probably be to limit usage of RDF4J in our public-facing API to
the ubiquitous interfaces found in the {{org.eclipse.rdf4j:rdf4j-model}}
artifact (e.g. {{IRI}} and {{Literal}}), since removing those would be
virtually impossible without enormous backwards compatibility issues.
Since this PR is quite large and there are a multitude of new classes and new
behaviors (while managing to remain fully backwards-compatible with previous
behavior), I'm looking for feedback! Please comment with any concerns,
questions, or suggestions you have for improvement.
PR can be viewed here: https://github.com/apache/any23/pull/122
was:
Currently extractors do not work in flows. I.E. Next extractor has no any
access to triples made by previous one.
It would be useful if an extractor has possibility to modify triples created by
another extractor.
Summary: Overhaul WriterFactory API (was: Add ability to run
extractors in flow)
> Overhaul WriterFactory API
> --------------------------
>
> Key: ANY23-396
> URL: https://issues.apache.org/jira/browse/ANY23-396
> Project: Apache Any23
> Issue Type: Improvement
> Components: core
> Affects Versions: 2.2
> Reporter: Jacek Grzebyta
> Assignee: Hans Brende
> Priority: Major
> Fix For: 2.3
>
>
> This issue began with Jacek's observation that, in Rover, it is impossible to
> specify a *delegating writer factory*, i.e., one that maps/filters/reduces
> the preliminary extraction output before passing it on to the final
> outputstream writer. Lack of this ability caused us to have to specify
> numerous configuration flags in Rover, e.g., "--notrivial", which filters the
> output of the extractor by removing trivial css triples prior to writing the
> triples to their final format. Many of these flags could simply be replaced
> by the ids of *delegating writer factories*, if we had such a capability. One
> added advantage of that would be that then, users could specify the *order*
> in which these modifications take place. E.g., adding a *logging* decorator
> could take place before or after the "notrivial" decorator has been applied
> (or both before *and* after!). Which? If we can, we should really let the
> user decide.
> The most obvious solution to this problem was to extend the {{WriterFactory}}
> interface with a new {{DelegatingWriterFactory}} interface that accepts an
> arbitrary {{TripleHandler}} rather than an {{OutputStream}} as input.
> In doing so, it was also necessary to deprecate a few methods in
> {{WriterFactory}} and un-deprecate them in an extending
> {{TripleWriterFactory}} class (which takes the place of {{WriterFactory}} by
> creating a {{TripleHandler}} from an {{OutputStream}}). This deprecation was
> actually not too painful, first, because some of the methods were redundant
> in the first place (e.g., {{getMimeType()}}), and second, because it
> presented us with a perfect opportunity to add some much-needed improvements
> to the new interface.
> The biggest improvement is the addition of {{Settings}} as a parameter to the
> {{TripleHandler}} constructor, which will allow users to configure writers as
> they see fit, rather than forcing, e.g., {{prettyprint=true}} on them.
> ANY23-388 perfectly illustrates this current lack of configuration ability.
> And we fixed that issue by simply giving users {{protected}} access to the
> underlying {{RDFWriter}} instances so that they could configure them
> manually. However, in hindsight, this was a bad idea, as it could lead to
> backwards compatibility issues down the line if we decide to change the
> underlying implementation of {{RDFWriterTripleHandler}} instances. Luckily,
> the solution to ANY23-388 was only implemented recently and is still only
> present in the snapshot version of Any23. In my PR, I've removed that hack
> and replaced it with {{Settings}}, which is extensible ad infinitum and won't
> pose the same threat to backwards compatibility.
> Another improvement is the removal of RDF4J classes from the public
> WriterFactory API. (I replaced {{RDFFormat}} with our own {{TripleFormat}}
> class.) As I noted in my PR, it's probably better for us to use our own
> classes in public-facing interfaces rather than RDF4J's so that we can
> maintain stability in the event that RDF4J changes their API, or (heaven
> forbid) ceases to exist, or we simply want to modify the implementation. A
> good rule of thumb for us would probably be to limit usage of RDF4J in our
> public-facing API to the ubiquitous interfaces found in the
> {{org.eclipse.rdf4j:rdf4j-model}} artifact (e.g. {{IRI}} and {{Literal}}),
> since removing those would be virtually impossible without enormous backwards
> compatibility issues.
> Since this PR is quite large and there are a multitude of new classes and new
> behaviors (while managing to remain fully backwards-compatible with previous
> behavior), I'm looking for feedback! Please comment with any concerns,
> questions, or suggestions you have for improvement.
> PR can be viewed here: https://github.com/apache/any23/pull/122
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)