Re: Extending Jena Text to Support ElasticSearch as Indexing/Querying Engine

Osma Suominen Fri, 03 Mar 2017 07:52:18 -0800

Hi Anuj,

Did you see my earlier message to the dev list? Are you subscribed tothat list? I will Cc: you this time just to be sure. Seehttp://jena.markmail.org/thread/uhs6cuhotzj4tjrj for the actual messagein case you missed it (including some replies).

I see what you mean by deprecating Solr first before removing it, but Ican't figure out how that would work in practice. If you're right aboutSolr 4.9.1 requiring Lucene 4.9.1, then we can't have Solr and ESsupport in Jena at the same time - unless we upgrade the Solr side aswell, which seems a bit of a waste of time if you're going to remove itanyway.

Like I explained in JENA-1301 there are many problems with the Solrimplementation and I doubt there are many users, quite possibly nobodyat all.

In any case switching indexing technologies for jena-text should berather easy, as the text index itself doesn't need to be migrated - itcan simply be rebuilt from the RDF data. So if someone runs, say, Fuseki2.5.0 with a Solr index, then upgrading to (as yet hypothetical) Fuseki2.6.0 with an ES index instead is just a matter of setting up ES,changing the text index configuration slightly and runningjena.textindexer (or reloading the data, whichever is easier). There isno technical benefit from having support for both Solr and ES in thesame Jena release as it doesn't make migration any easier, but ofcourse, advance warning might help with planning the move to ES.


-Osma


03.03.2017, 16:43, anuj kumar kirjoitti:

Hey,
 I just saw https://issues.apache.org/jira/browse/JENA-1301
Should we not first officially deprecate it and gives any users of Solr a
chance to move to different Indexing technology?

BTW, I dont know yet how to login to apache JIRA.

Thanks,
Anuj Kumar

On Fri, Mar 3, 2017 at 1:23 PM, anuj kumar <[email protected]> wrote:

I Osma,
 I briefly looked at the pull request. I beieve we need to upgrade Lucene
and Solr in one go, isnt it. The reason being Solr 4.9.1 depends on Lucene
4.9.1

Also how do i log into  issues.apache.org and where to file this bug?

Thanks,
Anuj Kumar

On Fri, Mar 3, 2017 at 11:22 AM, Osma Suominen <[email protected]>
wrote:

Hi Anuj,

It's great that we found agreement over this!

I've restarted the Lucene upgrade effort (JENA-1250) that had stalled and
made a PR [1] that implements the upgrade up to version 6.4.1 (with 5.5.4
as an intermediate step). I'll wait for comments on the PR and if people
think it's OK I will merge it soon to Jena master. Meanwhile, you can
already base your ES implementation on that branch [2] if you like.

Could you please open a JIRA issue on issues.apache.org explaining the
Elasticsearch support feature, so that we have a place for tracking this
work, request comments etc.

Also I suggest we move the discussion around this to the developers' list
([email protected]) where it's more appropriate.

-Osma

[1] https://github.com/apache/jena/pull/219

[2] https://github.com/osma/jena/tree/jena-1250-lucene6


03.03.2017, 02:45, anuj kumar kirjoitti:

I second that. I am now finalising the integration of ES and should have
a
good production quality implementation ready in a week's time.  At that
time I would want you guys to have a look at the implementation and
provide
feedback. Once you guys have upgraded Lucene to 6.4.1 , I can merge the
code in jena-text module and do a round of testing.

Thanks,
Anuj Kumar

On 2 Mar 2017 22:28, "A. Soroka" <[email protected]> wrote:

I do agree that trying to juggle different versions of Lucene libraries

is
probably not a realistic option right now. Luckily (if I understand the
conversation thus far correctly) we have a solid alternative; getting
our
current Lucene dependency upgraded should allow us to (eventually) merge
Anuj's work into the mainstream of development. Someone please tell me
if I
have that wrong! :grin:

Let me reiterate that this seems like very good work and speaking for
myself, I certainly want to get it included into Jena. It's just a
question
of fitting it in correctly, which might take a bit of time.

---
A. Soroka
The University of Virginia Library

On Mar 1, 2017, at 1:27 PM, Osma Suominen <[email protected]>

wrote:


Hi Anuj!

I have nothing against modularity in general. However, I cannot see how

your proposal could work in practice for the Fuseki build, due to the
reasons I mentioned in my previous message (and Adam seemed to concur).


In any case, I'll see what I can do to get the Lucene upgrade moving

again. If all current Jena modules (ie jena-text and jena-spatial) were
upgraded to Lucene 6.4.1, then you could just add your ES classes to
jena-text, right? I think that would be better for everyone than having
to
maintain your own separate module.


-Osma

01.03.2017, 16:59, anuj kumar kirjoitti:

I personally have no preference as to how the code in Jena should be
structured, as long as I am able to use it :).
I have personal preference of doing it in a specific way because IMO,

it is

modular which makes it much easier to maintain in the long run. But

again

it may not be the quickest one.


I already have been given a deadline, by the company to have ES

extension

implemented in the next 15 days :). What this means is that I will be

maintaining the ES code extension to Jena Text at-least locally for a
coming period of time. I would be more than happy to contribute to
Jena
community whatever is required to have a proper ElasticSearch
Implementation in place, whether within jena-text module or as a

separate

module. Till the time Lucene and Solr is not upgraded to the latest

version, I will have to maintain a separate module for jena-text-es.

Cheers!
Anuj Kumar


On Wed, Mar 1, 2017 at 3:36 PM, A. Soroka <[email protected]> wrote:

Osma--


The short answer is that yes, given the right tools you _can_ have
different versions of code accessible in different ways. The longer

answer

is that it's probably not a viable alternative for Jena for this

problem,

at least not without a lot of other change.


You are right to point to the classloader mechanism as being at the

heart

of this question, but I must alter your remark just slightly. From "the

Java classloader only sees a single, flat package/class namespace and

a set

of compiled classes" to "ANY GIVEN Java classloader only sees a single,

flat package/class namespace and a set of compiled classes".

This is the fact that OSGi uses to make it possible to maintain
strict
module boundaries (and even dynamic module relationships at
run-time).

Each

OSGi bundle sees its own classloader, and the framework is responsible

for

connecting bundles up to ensure that every bundle has what it needs in

the

way of types to function, based on metadata that the bundles provide

to the

framework. It's an incredibly powerful system (I use it every day and

enjoy

it enormously) but it's also very "heavy" and requires a good deal of

investment to use. In particular, it's probably too large to put

_inside_

Jena. (I frequently put Jena inside an OSGi instance, on the other

hand.)

Java 9 Jigsaw [1] offers some possibility for strong modularization
of
this kind, but it's really meant for the JDK itself, not application
libraries. In theory, we could "roll our own" classloader management

for

this problem. That sounds like more than a bit of a rabbit hole to me.

There might be another, more lightweight, toolkit out there to this
purpose, but I'm not aware of any myself.

Otherwise, yes, you get into shading and the like. We have to do that

for

Guava for now because of HADOOP-10101 (grumble grumble) but it's

hardly a

thing we want to do any more of than needed, I don't think.


---
A. Soroka
The University of Virginia Library

[1] http://openjdk.java.net/projects/jigsaw/

On Mar 1, 2017, at 9:03 AM, Osma Suominen <[email protected]

wrote:


Hi Anuj!

Thanks for the clarification.

However, I'm still not sure I understand the situation completely. I

know Maven can perform a lot of tricks, but Maven modules are just
convenient ways to structure a Java project. Maven cannot change the

fact

that at runtime, module divisions don't really matter (except that they

usually correspond to package sub-namespaces) and the Java
classloader

only

sees a single, flat package/class namespace and a set of compiled

classes

(usually within JARs) in the classpath that it needs to check to find

the

right classes, and if there are two versions of the same library (eg

Lucene) with overlapping class names, that's going to cause trouble.

The

only way around that is to shade some of the libraries, i.e. rename

them so

that they end up in another, non-conflicting namespace. Apparently

Elasticsearch also did some of that in the past [1] but nowadays
tries

to

avoid it.


Does your assumption 1 ("At a given point in time, only a single

Indexing Technology is used") imply that in the assembler

configuration,

you cannot have ja:loadClass declarations for both Lucene and ES

backends?

Or how do you run something like Fuseki that contains (in a single big

JAR)

both the jena-text and jena-text-es modules with all their

dependencies,

one of which requires the Lucene 4.x classes and the other one the

Lucene

6.4.1 classes? How do you ensure that only one of them is used at a

time,

and that the Java classloader, even though it has access to both

versions

of Lucene, only loads classes from the single, correct one and not the

other? Or do you need to have separate "Fuseki-Lucene" and
"Fuseki-ES"
packages, so that you don't end up with two Lucene versions within
the

same

Fuseki JAR?


-Osma

[1] https://www.elastic.co/blog/to-shade-or-not-to-shade

01.03.2017, 11:03, anuj kumar kirjoitti:

Hi Osma,

I understand what you are saying. There are ways to mitigate risks

and

balance the refactoring without affecting the existing modules. But I

will

not delve into those now. I am not an expert in Jena to convincingly

say

that it is possible, without any hiccups. But I can take a guess and

say

that it is indeed possible :)


For the question: "is it even possible to mix modules that depend
on
different versions of the Lucene libraries within the same
project?"

I actually do not understand what you mean by mixing modules. I

assume

you

mean having jena-text and jena-text-es as dependencies in a build

without

causing the build to conflict. If that is what you mean than the

answer

is

yes it is possible and quite simple as well. Let me explain how it

is
possible. But before that some assumption which I want to call out
explicitly.

*Assumption:*
1. At a given point in time, only a single Indexing Technology is

used

for

text based indexing and searching via Jean. What this means is that

we

will

either use Lucene Implementation OR Solr Implementation OR ES

Implementation at any given point in time.
2. Fuseki build does not depend on any Lucene 4.9.1 specific
classes

but

only on jena-text classes, if at all.


Based on these assumptions it is possible to create a build that

contains

jena-text based common classes + ES specific classes without any

compatibility issues. And it is infact quite simple. I did it in
the
current jena-text-es module and ran the entire build which
succeeded.
The key is to include the latest Lucene dependencies at the very

beginning

in the pom and then include jena-text dependency. Maven will then

automatically resolve the dependency issues by including the Lucene
librarires that we included in our es specific pom. Have a look the

pom

of

jena-text-es module here to see how it can be done :

https://github.com/EaseTech/jena/blob/master/jena-text-es/pom.xml


Thanks,
Anuj Kumar


On Wed, Mar 1, 2017 at 7:27 AM, Osma Suominen <

[email protected]>

wrote:


Hi Anuj,


I understand your concerns. However, we also need to balance
between

the

needs of individual modules/features and the whole codebase. I'm

willing to

put in the effort to keep the other modules up to date with newer

Lucene

versions. Lucene upgrade requirements are well documented, the only

hitches

seen in JENA-1250 were related to how jena-text (ab)used some Lucene

features that were dropped from newer versions.

A perhaps stupid question to more experienced Java developers: is
it

even

possible to mix modules that depend on different versions of the

Lucene

libraries within the same project? In my (quite limited)

understanding

of

Java projects and libraries, this requires special arrangements

(e.g.

shading) as the Java package/class namespace is shared by all the

code

running within the same JVM.


So can you create, say, a Fuseki build that contains the current

jena-text

module (depending on Lucene 4.x) and the new jena-text-es module

(depending

on Lucene 6.4.1) without any compatibility issues?


-Osma




01.03.2017, 00:47, anuj kumar kirjoitti:

Hi,


My 2 Cents :

The reason I proposed to have separate modules for Lucene, Solr
and

ES is

exactly for avoiding the "All or Nothing" approach we need to take

if

we

club them all together. If they stay together and if in the near

future I

want to upgrade ES to another version, I also need to again upgrade

Lucene

and Solr and possibly another implementation that may have been

added

during the time. As we all know, this means weeks of work if not

months to

get the changes released. This will personally de-motivate me to do

anything and I will probably start maintaining my version of

Jena-Text as

that would be much simpler to do than to upgrade and test and in

the

process own(read fix bugs) the upgrade for each and every

technology.

If they are developed as separate modules, they can evolve

independently

of

each other and we can avoid situations where we cant upgrade to

latest

version of Lucene because we do not know what effect it will have

on

Solr

Implementation.


We can start with having a separate Module for Jena Text ES and
see

how

things go. If they go well, we could extract out Solr and Lucene

out

of

Jena Text.


Again this is just a suggestion based on my limited industry

experience.

Thanks,
Anuj Kumar



On Tue, Feb 28, 2017 at 5:23 PM, Osma Suominen <

[email protected]

wrote:


28.02.2017, 17:12, A. Soroka kirjoitti:


https://lists.apache.org/thread.html/dce0d502b11891c28e57bbc

bb0cdef27d8374d58d9634076b8ef4cd7@1431107516@%3Cdev.jena.

apache.org

%3E

? In other words, might it be better to factor out between -text

and

-spatial and _then_ try to upgrade the Lucene version?



I certainly wouldn't object to that, but somebody has to

volunteer

to do

the actual work!


I don't use the Solr component now, but I could easily see so

doing...

that's pretty vague, I know, and I'm not in a position to do any

work to

maintain it, so consider that just a very small and blurry data

point.

:)



Last time I tried it (it was a while ago) I couldn't figure out

how

to

get

it running... If you could just try that with some toy data,
then

your

data

point would be a lot less blurry :) I haven't used Solr for

anything, so

I'm not very familiar with how to set it up, and the jena-text

instructions
are pretty vague unfortunately.


-Osma


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi




--
*Anuj Kumar*



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: Extending Jena Text to Support ElasticSearch as Indexing/Querying Engine

Reply via email to