Hi everyone!

My name is Brad and I'm based in Australia. I've been developing Rya for 
a few months now full-time as part of a comprehensive evaluation of 
Semantic Web technologies and in particular Rya, for our organisation. 
We're experienced users of Accumulo. We had some policy issues to 
overcome in regards to contributing to the Apache project but that is 
now resolved. I've been in contact with Adina along the way.

Rya seems pretty awesome, but it is held back by a lack of 
documentation, some unclean code and a few rough edges to getting 
started. For example, we could hook it up with Fluo Muchos to make it 
super easy for new people to spin up a working Rya cluster on an AWS or 
Azure cloud. My impression of Rya is that it is quite feature complete, 
but needs some work to be much more friendly to new adopters.

I put up a pull request last week that updated the maven dependencies of 
the project. Any help reviewing that would be appreciated. I know you're 
all busy so there is no great rush, but I'd love to collaborate and hear 
your priorities too.

I'm about 70 commits deep into my work on Rya in our organisation's code 
repository, so I've been pretty busy. I'm now trying to finalise some 
changes. I've been testing the performance against the original code in 
a small test cluster, and for some queries I've made Rya much faster, 
and for others, slower. I'm working on more changes which I think should 
improve it further. I've started testing against the LUBM 5000 dataset, 
DBPedia and OpenPermID.

I'm new to the world of Semantic Web but fortunately I have some 
experienced colleagues helping me along the way. I've been marking 
tickets in Jira as a work on them, and I'm trying to publish my pull 
requests onto GitHub faster. Hopefully a bunch will start appearing soon.

Please expect a large pull request soon that changes Rya to use data 
types that align better with RDF4J, but otherwise doesn't change 
functionality. I have a refactor of the Accumulo DAO that is cleaner and 
(once finished hopefully much) faster. I have fixed a number of other 
tickets and improved some of the doco and configuration files. I'll try 
to make the pull requests clean and reviewable, but unfortunately many 
of the improvements I'm making depend on other improvements I've made, 
so its a bit tricky to disentangle.

Some improvements I'll be putting up shortly also include:
Enhance accumulo.rya to support the use of bloom filter
Make timeout for SPARQL query configurable
Add an IPAddressRyaTypeResolver
NumberFormatException for large integers
Tomcat configuration for indexers
etc

If anyone with more Rya experience wants to request particular features 
or functionality to be worked on, I've love to heard from you. We're 
particularly interested in scaling Rya to very large data sets (thus 
performance is very important to us) and making Rya more generic in 
reading from other (pre-existing) Accumulo table layouts. I also want to 
fix reliability issues around indexing configuration and consistency of 
tables (for example, is there a mapreduce job that repairs the indexes 
if data is written from a misconfigured client?).

I hope to hear from you, and your thoughts on the future directions of Rya.

Brad

Reply via email to