Re: Jena build (some thoughts)

Andy Seaborne Tue, 06 Sep 2011 09:49:12 -0700


On 06/09/11 16:09, Paolo Castagna wrote:

Hi Andy
thanks for sharing some thoughts.

Apologies in advance for my long reply, full of questions.


Not full answers ... it's some thoughts.


Andy Seaborne wrote:

This message is a collection of thoughts on reworking the build process.
  It's not a complete proposal.

== Current Status

Subsystems (to avoid the "module" word)
   Jena, IRI, ARQ, LARQ, TDB, SDB, Fuseki


Joseki is not on the list. I imagine because Fuseki replaces it. I also imagine 
there will be no Apache release for Joseki ever.
Correct?


Maybe - it's about timing.  Fuseki with configurations files = Joseki4.

The current build system is a one maven project per subsystem. Each
subsystem produces a single download zip file and also deploys artifacts.

The builds are linked by version dependencies in the POM files.  There
is a hack to get ARQ into the Jena download to break the circular
dependency.


I have a simple test to decide how to break dependencies: can you use X without 
Y?

Currently, you can use Jena without ARQ.
However, you cannot use ARQ without Jena.
Therefore, ARQ depends on Jena.

Rightly so, Jena's pom.xml file does not have a dependency on ARQ. However, 
Jena distribution (i.e. the current .zip file) includes ARQ since most of the 
time people want to run SPARQL queries as well.


== Goals

These are my take on desirable features, not necessarily absolute
requirements, so if it isn't practical to achieve it in the overall
system, then a goal can be modified or removed.


  + Creating an Apache Release

I didn't put that because the message was about the maven structure, notthe files in each bit etc.

Does Apache Release force/suggest a particular maven layout? I'd besurprised if it did.

+ Balance cost of change and benefit (we don't have to start with a
clean sheet - we can leave some thigns as they are, because they are).

+ A single download zip file  for using Jena as a library


I imagine this is not very different from the current Jena distribution as
.zip file. Am I right?

I would expect to find a lib directory with all the dependent jars as well
as jena-x.y.z.jar, arq-x.z.y.jar, iri, etc.

No. I'm floating the idea there is also one jar in addition to the mavenartifacts.

Would that include TDB jar as well?


"that" = lib/ ?

No. One jar.


Would that include SDB jar as well?

No. This is the set of things you might want for "normal" use as alibrary in an application. Adding TDB, now it's got transactions, seemsto give a useful package of functionality. SDB would be separate - youhave to config SQL DBs to use it.


It's not expert use or fine tuning.

Does this imply there is no need for a .zip distribution of ARQ, SDB or TDB?


Correct.

A single download zip file is good: less confusion for people, less work
for us (i.e. we just manage a single zip file).

+ A single jar file for using Jena as a library


Could you be more precise on what this jar would include?


<assembly>
  <format>jar
  <dependencySets>
    <dependencySet>
      <includes>jena-core.jar, arq.jar, tdb.jar etc etc

Does it include all the necessary runtime dependencies or is it just code we 
write?


Just Jena.

Does it include ARQ, SDB and TDB?

I am not sure who this single jar file is targeted at.
Expert developers/users would probably don't like to have a single jar if that 
includes all the runtime dependencies as well.
Expert developers/users would probably don't use that single jar, since 
sometimes they want to use or test a patched version of just one of the 
components (i.e. ARQ, IRI, SDB, TDB, etc.)
New users would probably download the .zip distribution to use Jena for their 
first time.


Or use from maven.

Add one jar to the classpath, all the right versions checked and merged.

If we think Jena as a library, we should focus on modularity and ease of use 
with tools such Ivy, Maven, etc. and document this well as well as provide 
simple examples to start with (as we are trying
to do already).
Often, I just want to parse a simple Turtle file and I would find it annoying 
to include a ~20MB jar file just to do that.

  "When we started working on Any23, the Sesame library was more
   modularized and documented and there was also a full Maven support.
   Today much of these reasons are no longer valid."
   (from [email protected] mailing list)

I can relate to that sort of comments.

Now, Jena offers good|full(?) Maven support.
However, I would argue we are still lacking in terms of modularization.

If someone does not need an inference engine, or support for RDF/XML parsing, 
or OWL APIs, it would be good if he/she could use Jena without those parts.
Even more so for people wanting to run Jena on "constrained" environments, just 
to make an example: Android.

No reason you can't go get the pieces using maven. It's not either-or-- it's as-well-as.


One jar has been good in Fuseki.

A single jar file seems to me going against some of these things.
So, I'd like to understand more why you propose a "single jar file for using Jena as 
a library".

The main download should be complete - everything you need to write a
Jena application using Jena as a library.  I'd like to change to having
a single zip that is current Jena + ARQ + LARQ + TDB (maybe?).


I think this is a good idea.

I would also add SDB to the single zip file. Why not?

And, I would remove zip distribution files from ARQ, LARQ (which hasn't got one 
at the moment), SDB and TDB.

That puts datasets adn quads into Jena core.

+1

(Some API changes could also happen to make this feel more integrated.)

It also seems easier to deliver a single jar for this.


See above.

And a single obviously-named Maven artifact - at the moment, a single
dependency to pull is e.g. TDB because that pulls in the rest, which
isn't exactly obvious.


I am not sure I follow you here. Probably, because I don't understand what you exactly 
mean with "a single obviously-named Maven artifact".

If someone wants to use TDB the obvious dependency to pull in their project
is TDB (which depends on ARQ which depends on Jena) and let Maven, or any other 
tool which can download artifacts from a repository, to resolve the rest of the 
dependencies (with the right version
numbers), currently we have:

[INFO] com.hp.hpl.jena:tdb:jar:0.8.11-SNAPSHOT
[INFO] +- com.hp.hpl.jena:arq:jar:2.8.9-SNAPSHOT:compile
[INFO] |  +- org.codehaus.woodstox:wstx-asl:jar:3.2.9:compile
[INFO] |  |  \- stax:stax-api:jar:1.0.1:compile
[INFO] |  +- org.apache.lucene:lucene-core:jar:2.3.1:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
[INFO] |  |  \- commons-codec:commons-codec:jar:1.4:compile
[INFO] |  \- org.apache.httpcomponents:httpcore:jar:4.1.2:compile
[INFO] +- com.hp.hpl.jena:arq:jar:tests:2.8.9-SNAPSHOT:test
[INFO] +- com.hp.hpl.jena:jena:jar:2.6.4:compile
[INFO] |  +- com.ibm.icu:icu4j:jar:3.4.4:compile
[INFO] |  \- xerces:xercesImpl:jar:2.7.1:compile
[INFO] +- com.hp.hpl.jena:jena:test-jar:tests:2.6.4:test
[INFO] +- com.hp.hpl.jena:iri:jar:0.8:compile
[INFO] +- junit:junit:jar:4.8.2:test
[INFO] +- org.slf4j:slf4j-api:jar:1.6.1:compile
[INFO] +- org.slf4j:slf4j-log4j12:jar:1.6.1:compile
[INFO] \- log4j:log4j:jar:1.2.16:compile

Is something like this what you are proposing? :

[INFO] com.hp.hpl.jena:jena-all:jar:x.y.z
[INFO] +- org.codehaus.woodstox:wstx-asl:jar:3.2.9:compile
[INFO] |  \- stax:stax-api:jar:1.0.1:compile
[INFO] +- org.apache.lucene:lucene-core:jar:2.3.1:compile
[INFO] +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
[INFO] |  \- commons-codec:commons-codec:jar:1.4:compile
[INFO] |     \- org.apache.httpcomponents:httpcore:jar:4.1.2:compile
[INFO] +- com.ibm.icu:icu4j:jar:3.4.4:compile
[INFO] +- xerces:xercesImpl:jar:2.7.1:compile
[INFO] +- junit:junit:jar:4.8.2:test
[INFO] +- org.slf4j:slf4j-api:jar:1.6.1:compile
[INFO] +- org.slf4j:slf4j-log4j12:jar:1.6.1:compile
[INFO] \- log4j:log4j:jar:1.2.16:compile

An jena-all-x.y.z.jar uber jar?


Don't think "TDB" - think "Jena"

Other goals or considerations?


As I've said above: creating a Apache releases should be IMHO the number one 
priority.


Does this affect the maven layout?


As well as "release early, release often", at least at the beginning while we 
get use to the Apache processes and how to create Apache releases. This is also one of 
the objectives of the incubation
phase and, more importantly, a requirement for graduation.

How can we demonstrate our ability to create Apache releases if we release 
every 6 or 12 months?
Or, how long do we expect to be in "incubation"? ;-)

Initially, while we get used to the Apache way to cut releases, we should 
release more often and not be afraid of quickly fixing problems or apply 
improvements to the release process using minor
version numbers _._.x. Nowadays, if a new jar is a drop-in replacement people 
are willing to (and can easily) upgrade.

== Possible build layout.

Divide the overall project into a number of maven modules for building
parts of the system and a number of projects for making deliverables.
Just the code modules would mean you can work from a set of many jars
and mix and match for development etc.


For others, we went down this route already... without using Maven (i.e.
using Ant + Ivy, and it has been painful). The result is here:
https://svn.apache.org/repos/asf/incubator/jena/Import/Jena-SVN/Experimental/Jena3/trunk/

It was a good experience to see what's necessary to separate a "core" from 
RDF/XML parsing, etc. and to think about what a minimal system would include.

Application writers can get a single consolidated jar.


Here, again, I don't understand who you mean with "application writers" (is it 
us? are other companies using Jena? are University students? are other Apache 
committers?) and why the would benefit from
a "single consolidated jar".

Jena-top-POM -- common declarations, a lot of properties getting set.


+1

I think having a Jena specific parent pom.xml file is good idea (and a best 
practice).
We usually do this for our internal projects @ Talis.

Large organizations have a corporate parent pom as well (which for us should be 
this:
http://repo1.maven.org/maven2/org/apache/apache/).

Code modules:

   JenaSys
     -- This is the current Jena2.
       How much do we want to split it up?
       Is it worth the effort?
         core = graph + datatypes


+1 (== I think is a good idea and I am prepared to help here) on a 
small/minimal Jena core module.

I often need just this (+ RIOT below) and I imagine it would make a lot of 
people and projects (such as Any23, just to make an example) happy.
Another use case: launching MapReduce jobs with ~20MB jar files is a bit of a 
pain, you often need just core + RIOT (to parse N-Triples|N-Quads files) there.

         RDF API (inc enhanced?)
         owlapi
         rules
         Assembler? Here? Module of it's own?
   IRI
   Atlas -- Non RDF specific stuff.


+1 (== ditto) on separating out Atlas from ARQ.

   RIOT
     -- ideally Jena-code+RIOT is a useful set

+1

     -- Move ARP and XML output here or separate module again.


As a separate module, it changes much less often than RIOT.
Another use case: apparently you cannot run with Xerces on Android. This caused 
problems to people wanting to use Jena on Android.

   ARQ
    -- minus atlas, and RIOT

+1

   TDB -- transactional
   SDB
   RDB? Legacy or remove once and for all.


+1 on having RDB as a deprecated separate module.

   Documentation = website only.


+1

The disadvantage is that on the website you only have the most recent 
documentation, not the one corresponding to the (maybe obsolete version of 
Jena) you might be using.
However, since Jena is quite stable now, I don't think this will be a problem 
(and we can always revisit/change this in future).

Deliver modules:

   Jena  -- the deliverable: one jena-the-jar and zip file.
   JenaCmd -- Command line things: Jena+ARQ+TDB commands


Maven artifacts, IMHO, should be included in the list of "deliverables" of a 
Jena release (although only what we will be putting here http://www.apache.org/dist/jena/ 
has 'legal' value in Apache).

Clearly - a maven module produces maven artifacts. Each of jena, arq,tdb etc etc still produces and deploys it's own jar.


It gets repacked *as well* into convenient forms.

Rationale to consider also Maven artifacts as first-class deliverables of a 
Jena release is: Jena is 'mostly' a library which people use to write 
applications and modern building tools/systems (such
as Maven, but not only that) have dependencies engine to transitively resolve 
dependencies as well as on-line repositories where developers can easily find 
artifacts (including sources and test
packages). Once you have that, you rarely manually download a .zip or .tar.gz 
as developer.

... and we all felt the pain of failing to find an artifact of a library we 
want to use.

Fuseki is a separate module and deliverable.  It uses combined Jena as a
dependency but does not need to be part of the library build.


I agree.

Fuseki is something an end-user wants to: download, unzip, (load data) and run.

Eyeball is a separate module and deliverable.  It uses combined Jena as
a dependency but does not need to be part of the library build.


I mostly agree.

I've never used Eyeball much, however someone might want to include/use Eyeball 
in their application (with additional/custom extensions/checks).
For this reason, I would argue it's not be a bad idea to have Eyeball artifacts 
(i.e. an eyeball-x.y.z.jar) published as Maven artifact on the Apache Maven 
Repository.


=== Questions and notes.

1/ We currently make some attempt to deliver the test suite in the zip
so people can locally run it to check an installation.  From memory, the
only thing this seems to catch is problems running the test suite, not
problems with installation.  Maybe it's not worth the effort.


+1 on removing testing.zip

Rationale: if you want to run the test suite you should be able to checkout a 
tagged source tree and type mvn test. For example:

   svn co 
https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/tags/ARQ-2.8.8/ arq
   cd arq
   mvn test

This is a much better way to let people run the test suite on their system 
(i.e. different OS, different JVM, etc.)

I do agree that it's not the exactly the same as running the test suite against 
arq-x.y.z.jar, but how many other Apache projects do you know who are doing 
this? ;-)

However, it is sometimes useful to publish the test suite as Maven artifact. 
This way people can specify a dependency on that and reuse tests or utilities 
we have in our test suites elsewhere. This is
the reason why, for example, we have 
http://repo1.maven.org/maven2/com/hp/hpl/jena/arq/2.8.8/arq-2.8.8-tests.jar (as 
well as:
http://repo1.maven.org/maven2/com/hp/hpl/jena/arq/2.8.8/arq-2.8.8-test-sources.jar).
 I consider this a good practice and, if possible, I'd like to keep it.

The ideal situation (and best practice) would be to have the files necessary to 
run the test suite included in that jar (i.e. arq-x.y.z-tests.jar). Maven has 
support for that, but people need to use
getSystemResourceAsStream() to read test files (as I am sure you know). At 
development time, those files must be in src/test/resources (for example, LARQ 
does this:
https://svn.apache.org/repos/asf/incubator/jena/Jena2/LARQ/trunk/src/test/resources/).
 This would be my favorite option, but it requires some changes.

2/ The Apache top level POM has a list of versioned plugins in it which
we'd inherit.  Hopefully it helps with an Apach release but it does seem
quite a lot.  The default compilation is Java 1.4 -- we need to check
details.


LARQ pom.xml file, for example, has this:
https://svn.apache.org/repos/asf/incubator/jena/Jena2/LARQ/trunk/pom.xml

   <parent>
     <groupId>org.apache</groupId>
     <artifactId>apache</artifactId>
     <version>9</version>
   </parent>

However, it specifies Java 1.6 for compiling:

       <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-compiler-plugin</artifactId>
         <configuration>
           <source>${jdk.version}</source>
           <target>${jdk.version}</target>
           <encoding>${project.build.sourceEncoding}</encoding>
         </configuration>
       </plugin>

You can verify the effective pom.xml file using: mvn help:effective-pom


My point was we need to be careful.

So, technically, the fact that Apache parent pom.xml has Java 1.4 as default 
compilation isn't an issue.
I've not found problems with it, so far. This does not mean there aren't any... 
but we should be able to override any behavior we don't like it and we control 
if/when upgrade from a version to another.

I think we will be better off in having the org.apache:apache:9 as parent pom 
(directly or via our own parent pom), as suggested here:
http://www.apache.org/dev/publishing-maven-artifacts.html#inherit-parent

3/ For RDB, I propose creating a maven module and putting the code here
with a dependency of whatever version of Jena it is at the time then
leaving it frozen.  Alternatively, zip up the code and dump somewhere in
case anyone wants to port it.


+1 on having RDB as separate module (depending on Jena).

4/ Shall we leave the documentation out of the build and just have it on
the website?


What about javadocs?


All maven artifacts should have javadocs and source available.

I really don't understand projects that don't put -sources up as well.But then, I strongly prefer to attach the sources to the javadocs.

5/ Jump to maven 3?


Not sure why are you asking this.

I am still using Maven v2.x.y on my desktop (without problems) but we are using 
Maven v3.0.3 with some of our modules on Jenkins 
(https://builds.apache.org/view/G-L/view/Jena/) currently (let's cross
fingers) with no problems (and it should be more stable in the future).

"""

While Maven 3 aims to be backward-compatible with Maven 2.x to theextent possible, there are still a few significant changes.

"""


Paolo

Re: Jena build (some thoughts)

Reply via email to