[ 
https://issues.apache.org/jira/browse/ANY23-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende reassigned ANY23-396:
---------------------------------

         Assignee: Hans Brende  (was: Jacek Grzebyta)
         Priority: Major  (was: Minor)
    Fix Version/s: 2.3
      Description: 
This issue began with Jacek's observation that, in Rover, it is impossible to 
specify a *delegating writer factory*, i.e., one that maps/filters/reduces the 
preliminary extraction output before passing it on to the final outputstream 
writer. Lack of this ability caused us to have to specify numerous 
configuration flags in Rover, e.g., "--notrivial", which filters the output of 
the extractor by removing trivial css triples prior to writing the triples to 
their final format. Many of these flags could simply be replaced by the ids of 
*delegating writer factories*, if we had such a capability. One added advantage 
of that would be that then, users could specify the *order* in which these 
modifications take place. E.g., adding a *logging* decorator could take place 
before or after the "notrivial" decorator has been applied (or both before 
*and* after!). Which? If we can, we should really let the user decide. 

The most obvious solution to this problem was to extend the {{WriterFactory}} 
interface with a new {{DelegatingWriterFactory}} interface that accepts an 
arbitrary {{TripleHandler}} rather than an {{OutputStream}} as input. 

In doing so, it was also necessary to deprecate a few methods in 
{{WriterFactory}} and un-deprecate them in an extending {{TripleWriterFactory}} 
class (which takes the place of {{WriterFactory}} by creating a 
{{TripleHandler}} from an {{OutputStream}}). This deprecation was actually not 
too painful, first, because some of the methods were redundant in the first 
place (e.g., {{getMimeType()}}), and second, because it presented us with a 
perfect opportunity to add some much-needed improvements to the new interface.

The biggest improvement is the addition of {{Settings}} as a parameter to the 
{{TripleHandler}} constructor, which will allow users to configure writers as 
they see fit, rather than forcing, e.g., {{prettyprint=true}} on them.

ANY23-388 perfectly illustrates this current lack of configuration ability. And 
we fixed that issue by simply giving users {{protected}} access to the 
underlying {{RDFWriter}} instances so that they could configure them manually. 
However, in hindsight, this was a bad idea, as it could lead to backwards 
compatibility issues down the line if we decide to change the underlying 
implementation of {{RDFWriterTripleHandler}} instances. Luckily, the solution 
to ANY23-388 was only implemented recently and is still only present in the 
snapshot version of Any23. In my PR, I've removed that hack and replaced it 
with {{Settings}}, which is extensible ad infinitum and won't pose the same 
threat to backwards compatibility. 

Another improvement is the removal of RDF4J classes from the public 
WriterFactory API. (I replaced {{RDFFormat}} with our own {{TripleFormat}} 
class.) As I noted in my PR, it's probably better for us to use our own classes 
in public-facing interfaces rather than RDF4J's so that we can maintain 
stability in the event that RDF4J changes their API, or (heaven forbid) ceases 
to exist, or we simply want to modify the implementation. A good rule of thumb 
for us would probably be to limit usage of RDF4J in our public-facing API to 
the ubiquitous interfaces found in the {{org.eclipse.rdf4j:rdf4j-model}} 
artifact (e.g. {{IRI}} and {{Literal}}), since removing those would be 
virtually impossible without enormous backwards compatibility issues.

Since this PR is quite large and there are a multitude of new classes and new 
behaviors (while managing to remain fully backwards-compatible with previous 
behavior), I'm looking for feedback! Please comment with any concerns, 
questions, or suggestions you have for improvement. 

PR can be viewed here: https://github.com/apache/any23/pull/122
 

  was:
Currently extractors do not work in flows. I.E. Next extractor has no any 
access to triples made by previous one.

It would be useful if an extractor has possibility to modify triples created by 
another extractor.

 

          Summary: Overhaul WriterFactory API  (was: Add ability to run 
extractors in flow)

> Overhaul WriterFactory API
> --------------------------
>
>                 Key: ANY23-396
>                 URL: https://issues.apache.org/jira/browse/ANY23-396
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 2.2
>            Reporter: Jacek Grzebyta
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> This issue began with Jacek's observation that, in Rover, it is impossible to 
> specify a *delegating writer factory*, i.e., one that maps/filters/reduces 
> the preliminary extraction output before passing it on to the final 
> outputstream writer. Lack of this ability caused us to have to specify 
> numerous configuration flags in Rover, e.g., "--notrivial", which filters the 
> output of the extractor by removing trivial css triples prior to writing the 
> triples to their final format. Many of these flags could simply be replaced 
> by the ids of *delegating writer factories*, if we had such a capability. One 
> added advantage of that would be that then, users could specify the *order* 
> in which these modifications take place. E.g., adding a *logging* decorator 
> could take place before or after the "notrivial" decorator has been applied 
> (or both before *and* after!). Which? If we can, we should really let the 
> user decide. 
> The most obvious solution to this problem was to extend the {{WriterFactory}} 
> interface with a new {{DelegatingWriterFactory}} interface that accepts an 
> arbitrary {{TripleHandler}} rather than an {{OutputStream}} as input. 
> In doing so, it was also necessary to deprecate a few methods in 
> {{WriterFactory}} and un-deprecate them in an extending 
> {{TripleWriterFactory}} class (which takes the place of {{WriterFactory}} by 
> creating a {{TripleHandler}} from an {{OutputStream}}). This deprecation was 
> actually not too painful, first, because some of the methods were redundant 
> in the first place (e.g., {{getMimeType()}}), and second, because it 
> presented us with a perfect opportunity to add some much-needed improvements 
> to the new interface.
> The biggest improvement is the addition of {{Settings}} as a parameter to the 
> {{TripleHandler}} constructor, which will allow users to configure writers as 
> they see fit, rather than forcing, e.g., {{prettyprint=true}} on them.
> ANY23-388 perfectly illustrates this current lack of configuration ability. 
> And we fixed that issue by simply giving users {{protected}} access to the 
> underlying {{RDFWriter}} instances so that they could configure them 
> manually. However, in hindsight, this was a bad idea, as it could lead to 
> backwards compatibility issues down the line if we decide to change the 
> underlying implementation of {{RDFWriterTripleHandler}} instances. Luckily, 
> the solution to ANY23-388 was only implemented recently and is still only 
> present in the snapshot version of Any23. In my PR, I've removed that hack 
> and replaced it with {{Settings}}, which is extensible ad infinitum and won't 
> pose the same threat to backwards compatibility. 
> Another improvement is the removal of RDF4J classes from the public 
> WriterFactory API. (I replaced {{RDFFormat}} with our own {{TripleFormat}} 
> class.) As I noted in my PR, it's probably better for us to use our own 
> classes in public-facing interfaces rather than RDF4J's so that we can 
> maintain stability in the event that RDF4J changes their API, or (heaven 
> forbid) ceases to exist, or we simply want to modify the implementation. A 
> good rule of thumb for us would probably be to limit usage of RDF4J in our 
> public-facing API to the ubiquitous interfaces found in the 
> {{org.eclipse.rdf4j:rdf4j-model}} artifact (e.g. {{IRI}} and {{Literal}}), 
> since removing those would be virtually impossible without enormous backwards 
> compatibility issues.
> Since this PR is quite large and there are a multitude of new classes and new 
> behaviors (while managing to remain fully backwards-compatible with previous 
> behavior), I'm looking for feedback! Please comment with any concerns, 
> questions, or suggestions you have for improvement. 
> PR can be viewed here: https://github.com/apache/any23/pull/122
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to