Adam Heath wrote:
Adrian Crum wrote:
I ran my patch against your recent changes and the errors went away. I
guess we can consider that issue resolved.

Yeah, I did do some changes to SequenceUtil a while back.  The biggest
functional change was to remove some variables from the inner class to
the outer, and not try to access them all the time.

As far as the approach I took to multi-threading the data load - here is
an overview:

I was able to run certain tasks in parallel - creating entities and
creating primary keys, for example. I have the number of threads
allocated configured in a properties file. By tweaking that number I was
able to increase CPU utilization and reduce the creation time. Of course
there was a threshold where CPU utilization was raised and creation time
decreased - due to thread thrash.

So each entity creation itself was a separate work unit.  Once an
entity was created, you could submit the primary key creation as well.
 That's simple enough to implement(in theory, anyways).  This design
is starting to go towards the Sandstorm(1) approach.

There are ways to find out how many cpus are available.  Look at
org.ofbiz.base.concurrent.ExecutionPool.getNewOptimalExecutor(); it
calls into ManagementFactory.

I don't think the number of CPUs is useful information. Even a single CPU system might benefit. From my perspective, the best approach is to have a human tweak the settings to get the result they want. I might be wrong, but I don't think you can do that automatically.

Creating foreign keys must be run on a single thread to prevent database
deadlocks.

Maybe.  If the entity and primary keys are all created for both sides
of the foreign key, then shouldn't it be possible to submit the work
unit to the pool?

I don't know - I didn't spend a lot of time thinking about it. I just separated out the create foreign keys loop and executed it in a single thread. It would be fun to go back and analyze the code more and come up with a multi-threaded solution.

I multi-threaded the data load by having one thread parse the XML files
and put the results in a queue. Another thread services the queue and
loads the data. I also multi-threaded the EECAs - but that has an issue
I need to solve.

Hmm.  You dug deeper, splitting up the points into separate calls.  I
hadn't done that yet, and just dumped each xml file to a separate
thread.  My approach is obviously wrong.

My original goal was to reduce the ant clean-all + ant run-install cycle
time. I recently purchased a much faster development machine that
completes the cycle in about 2 minutes - slightly longer than the
multi-threaded code, so I don't have much of an incentive to develop the
patch further.

I've reduced the time it takes to do a run-tests loop.  The changes
I've done to log4j.xml reduces the *extreme* debug logging produced by
several classes.  log4j would create a new exception, so that it could
get the correct class and line number to print to the log.  This is a
heavy-weight operation.  This mostly showed up as slowness when
catalina would start up, so this set of changes doesn't directly
affect the run-install cycle.

I had to disable logging entirely in the patch. The logger would get swamped and throw an exception - bringing everything to a stop.

The whole experience was an educational one. There is a possibility the
techniques I developed could be used to speed up import/export of large
datasets. If anyone is interested in that, I am available for hire.

We have a site, where users could upload original images(6), then fill
out a bunch of form data, then some pdfs would be generated.  I would
submit a bunch of image resize operations(had to make 2 reduced-size
images for each of the originals).  All of those are able to run in
parallel.  Then, once all the images were done, the 2 pdfs would be
submitted.  This entire pipeline itself might be run in parallel too,
as the user could have multiple such records that needed to be updated.

1: http://www.eecs.harvard.edu/~mdw/proj/seda/

Cool site Bro.

Reply via email to