Dear JRuby hackers,

In a recent project, we are using JRuby scripts in Redbridge, over the
Kettle ETL server, using that fantastic plugin:
https://github.com/type-exit/Ruby-Scripting-for-Kettle

So basically, in each thread of ETL data transformation, we instantiate a
new org.jruby.embed.ScriptingContainer, provide it with some context data
from the ETL dataflow, and give it a customizable scriptlet to interpret for
each data flow line to transform the data.

So far so good.
The issue is some of those customizable script tend to use Ruby goodness to
generate lot's of classes. Imagine if you call a Rails script console in
that flow for instance, there will be class generated for your models.
We use the 'ooor' gem here to connect to OpenERP in our case and push/pull
information from it, but the principle is the same.

You would say it would be stupid to generate those classes for each new data
flow line. Of course we don't do that, we are able to use generate those
classes only at each new run of a Kettle transformation (using the "start
script" tab).

Still, we use Kettle as a long running server (it's called "Carte", the
purpose is to skip the slow Java startup time and get the JVM to warm up),
and what happen is that we execute those transformations over and over (like
import orders from Amazon into OpenERP each hour).

What happens is that this class generation is ending up bloating the
PergmGen memory space, making it a memory hog and finally breaking into an
out of memory.


Today, in the Ruby-Scripting-for-Kettle plugin, a ScriptContainer is
instantiated this way:
container = new ScriptingContainer(LocalContextScope.SINGLETHREAD,
LocalVariableBehavior.PERSISTENT);

What we want is to get the generate proxies classes like OpenERPProduct,
OpenERPCustomer (generated by the ooor gem) to persist from one
transformation execution to an other.
Playing with the several LocalContextScope options, the only way I found was
to create only one single ScriptContainer instance, that I then share (using
a Singleton pattern) between the various Ruby Kettle steps (eg living in
different threads). and to use the LocalContextScope.SINGLETON option. Then
I can avoid re-generating my proxy classes because they persist from one
scriptlet execution to another.

The issue is that with LocalContextScope.SINGLETON, each thread is altering
he same global variables and our several Ruby Kettle steps are very likely
to conflict one another (this happens in my tests)!

So we would need an option where we would not share global variables between
threads. But we would instead only share the generated classes.
Supposing I keep the option to have a single ScriptContainer (with a
non SINGLETON LocalContextScope then), Isn't there any Ruby placeholder
where I could place what I want to persist accross executions, namely my
generated class proxies?


The alternative I see would be the -XX:+CMSClassUnloadingEnabled JVM option
(I didn' test yet). But this doesn't look like the optimal solution.
Also I wonder, if I use ruby methods from say OpenERPProduct proxy and this
class is re-generated a each data transformation run, would the JVM JIT
and in-lining perform properly?


As far as I can see, the issue is a bit the same as with jruby-rack gem. I
jruby-rack I see that you are using a thread of Ruby runtimes and that you
are not using Redbridge at all.
Would there be a solution with Redbridge for that kind of issue, what would
you advise?


Thank you very much for any help. Again, kudos and many thanks for the
fantastic work you made on JRuby.


Raphaƫl Valyi
www.akretion.com

Reply via email to