[hibernate-dev] Re: Hibernate Search: massive batch indexing

Emmanuel Bernard Mon, 09 Jun 2008 11:11:29 -0700


On  Jun 7, 2008, at 20:14, Sanne Grinovero wrote:

thanks for your insights :-)
I'll try explain myself better inline:

2008/6/7 Emmanuel Bernard <[EMAIL PROTECTED]>:
This sounds very promising.
I don't quite understand why you talk about loading lazy objectsthough?On of the recommendations is to load the object and all it's relatedobjects before indexing. No lazy triggering should happen.
eg "from User u left join fetch u.address a left join fetch a.country"
if Address and Country are embedded in the User index.
I am talking about the lazy object loading because it is not alwayspossible toload the complete object graph eagerly because of the cartesianproblem;the "hints" I mention in point A is mainly (but not limited to) theleft join
fetch instruction needed to load the root entity.
However if I put all needed collections in the fetch join I kill theDB
performance and am flooded by data; I have made many experiments to
find the "gold balance" between eager and lazy and know for sure itis much
faster keeping most stuff out of the initial "fetch join"
My current "rule of thumb" is to load no more than two additionalcollections,
the rest goes lazy.
Also we should keep in mind the eager/lazy/subselect strategies
going to be chosen for the entities will probably be selected for
"normal" business operations finetuning and not for indexingperformance;
I had to fight somehow with other devs needing some setting for
other usecases in a different way than what I needed to bring indexing
timings down.

I understand. You could use Hibernate.initialize and batch-sizeupfront to help in this area *before* passing it to Hibernate Search.

I think the E limitation is fine, we can position this API asoffline indexing of the data. That's fair enough for a start. Idon't like your block approach unless you couple it with JMS. I amuncomfortable in keeping work to be done for a few hours in a VMwithout persistent mechanism.
I am glad to hear it's fine to position it as "offline" API, as astart.
Do you think we should enforce or check it somehow?


Let's add a modal box "Are you sure?" ;)
I don't think you can really enforce that (especially on a cluster).


For later improvements the batching IndexWriter could be "borrowed" by
the committing transactions to synchronously write their data away,
we just need to avoid the need of and IndexReader for deletions;

I've been searching for a solution in my other post... if that couldbe fixed

and a single IndexWriter per index could be available you could
have batch indexing and normal operation available together.


I will answer on the second post.

"this pool is usually the slowest as it has to initialize many lazyfields,
so there are more threads here."
I don't quite understand why this happens.
I suppose I should show you an ER diagram of our model; in our casebut I believein most cases people will search for an object basing his "fulltext"idea on many differentfields which are external to the main entity: intersecting e.g.author nickname with historic period,considering book series, categories and collections, or by a specialcode in one of
30 other legacy library encoding schemes.
The use case actually shows that very few fields are read from theroot entity, but mostare derived from linked many-to-many entities, sometimes going to asecond or third levelof linked information. I don't think this is just my case, IMHO itis very likely mostreal world applications will have a similar problem, we have toencode in the rootobject many helper fields to make most external links searchable; Ibelieve this is part of the"dealing with the mismatch between the index structure and thedomain model"
which is Search's slogan (pasted from homepage).
So what is the impact of your code on the current code base? Do youneed to change a lot of things? How fast do you think you could havea beta in the codebase?
I still have not completely understood the locks around the indexes;I believe the impact on current code is not so huge, I should needto knowhow I should "freeze" other activity on the indexes: Indexing couldjust start but other threads will be waiting a long time; should other
methods  check and throw an exception when mass indexing is busy?


Let's not envision an exception for the moment.

The locks must be acquired in a specific order, aside from that, thisshould be straightforward


Is it ok for one method to spawn 40 threads?

It's OK if it's there is only one call per VM doing that. If everyclient does that, then that's not good :)


How should the "management / progress monitor API" look like?

Maybe like the Hibernate Statistics. It depends on what the API shoulddo

I didn't look at similarity and sharding, is it ok for a first betato avoid this features? I don't think it should be difficult tofigure out, but would like
to show working code prototypes asap to have early feedback.


no problem

I think that if the answers to above questions don't complicate mycurrent code the effort to integrate it is less than a week of work;unfortunately this translatesin 4-6 weeks of time as I have other jobs and deadlines, maybe lesswith some luck.
How should this be managed? a branch? one commit when done?

If you don't disrupt the rest of the features, then you cand applythem in trunk, if you are afraid, then do a branch. But branches arepain to merge back in SVN.

Let's spin a different thread for the "in transaction" pool, I amnot entirely convinced it actually will speed up things.Yes I agree there probably is not a huge advantage, if any; the mainreason would be to have "normal operation" available
even during mass reindexing, performance improvements would be limited
to special cases such as a single thread committing severalentities: the "several" would benefit from batch behavior.The other thread I had already started is linked to this: IMHO weshould improve the deletion of entities first.
On  Jun 6, 2008, at 18:51, Sanne Grinovero wrote:

Hello list,
I've finally finished some performance test about stuff I wanted todouble-checkbefore writing stupid ideas to this list, so I feel I can at lastpropose
some code to (re)building the index for Hibernate Search.

The present API of Hibernate Search provides a nice and safe
transactional "index(entity)",
but even when trying several optimizations it doesn't reach the speed
of an unsafe (out of transaction) indexer we use in our current
production environment.
Also reading the forum it appears that much people are having
difficulties in using
the current API, even with a good example in the referencedocumentationsome difficulties arise with Seam's transactions and with huge datasets.(I'm NOT saying something is broken, just that you need a lot ofexpertise
to get it going)

SCENARIO
=======

* Developers change an entity and want to test the effect on the index
structure,
 thay want do to search experiments with the new fields.
* A production system is up(down)graded to a new(old) release,
involving index changes.
 (the system is "down for maintance" but the speed is crucial)
* Existing index is corrupted/lost. (Again, speed to recover iscritical)
* A Database backup is restored, or data is changed by other jobs.
* Some crazy developer like me prefers to disable H.Search's event
listeners for some reason.
 (I wouldn't generally recommend it, but have met other people who
have a reasonable
 argument to do this. Also in our case it is a feature as new entered
books will be
 available for loans only from the next day :D)
* A Lucene update breaks the index format (not so irrationale as they
just did on trunk).

PERFORMANCE
=======

In simple use cases, such as less than 1000 entities and not too much
relationships,
the exising API outperforms my prototype, as I have some costly setup.
In more massive tests the setup costs are easily recovered by a much
faster indexing speed;
I have many data I could send, I'll just show some and keep thedetails simple:
entity "Operator": standard complexity, involves loading of +4 objs, 7
field affect index
entity "User": moderate complexity, involves loading of +- 20 objs, 12
affect index data
entity "Modern": high complexity, loading of 44 entities, many are
"manyToMany", 25 affect index data

On my laptop (dual core, local MySQL db):
type            Operator                User            Modern
number          560                     100.000         100.000
time-current    0,23 secs               45''            270.3''
time-new        0,43 secs               30''            190''
On a staging server (4 core Xeon with lots of ram and dedicated DBserver):
type            Operator                User            Modern
number          560                     200.000         4.000.000
time-current    0,09 secs               130''           5h20'
time-new        0,25 secs               22''            19'

[benchmark disclaimer:
These timings are meant to be relative to each other for my particular
code version, I'm not an expert of Java benchmarking at all.
Also unfortunately I can't really access the same hardware for eachtests.
I used all possible tweaks I am aware of in Hibernate Search, actually
enabling new needed params to make the test as fair as possible.]

Examining the numbers:
with current recommended H.Search strategy I can index 560 simpleentities
in 0,23 seconds; quite fast and newbe users will be impressed.
At the other extreme, we index 4 million complex items, but I needmore
than 5 hours to do that; this is more like real use case and it could
scare several developers.
 Unfortunately I don't have a complete copy of the DB on my laptop,
but looking at the numbers it looks like my laptop could finish
in 3 hours, nearly double the speed of our more-than-twice fastserver.
(yes I've had several memory leaks :-) but they're solved now)
 The real advantage is the round-trip to database: without multiple
threading each lazy loaded collection somehow annotated to be indexed
massively slows down the whole process; If you look at both DB an AS
servers, they have very low resource usage confirming this, while mylaptopstays at 70% cpu (and killing my harddrive) because he has dataavailable
locally, producing a constant feed of strings to my index.
 When using the new prototype (about 20 threads in 4 different pools)
I get the 5hours down to less than 20minutes; Also I can start the
indexing of all 7 indexable types in parallel and it will stayaround 20minutes.
The "User" entity is not as complex as Modern (less lazy loaded data)
but confirms the same numbers.

ISSUES
=======
About the current version I've ready.
It is not a complete substitute of the current one and is far fromperfect;
currently these limitations apply but could be easily solved:
(others I am not aware of not listed :-)

A) I need to "read" some hints for each entity; I tinkered with a new
annotation,
 configuration properties should work but are likely to be quite
verbose (HQL);
 basically I need some hints about fetch strategies appopriate
 for batch indexing, which are often different than normal use cases.

B) Hibernate Search's indexing of related entities was not available
when I designed it.
I think this change will probably not affect my code, but I stillneed to
 verify the functionality of IndexEmbedded.
C) It is finetuned for our entities and DB, many variables areconfigurable but
 some stuff should be made more flexible.
D) Also index sharding didn't exist at the time, I'll need to changesome stuffto send the entities to the correct index and acquire theappropriate locks.
The next limitations is not easy to solve, I have some ideas but noone I liked.
E) It is not completely safe to use it during other data modification;
It's not a problem in our
 current production but needs much warning in case other people
wants to use it.
 The best solution I could think of is to lock the current workqueue
of H.Search,
 so to block execution of work objects in the queue and resume the
execution of
 this work objects after batch indexing is complete.
 If some entity disappears (removed from DB but a reference is in
the queue) it
can easily be skipped, if I index "old version" of some other datait will be
 fixed when scheduled updates from H.S. eventlisteners are resumed;
 (and the same for new entities).
 It would be nice to share the same database transaction during the
whole process,
 but as I use several threads and many separate sessions I think
this is not possible
 (this is the best place to ask I think;-)

GOING PRACTICAL
===============
if (cheater) goto :top

A nice evictAll(class) exists, I would like to add indexAll(class).
It would be nice to provide non-blocking versions, maybe overloading:
indexAll(Class clazz, boolean block)
or provide a Future as return object, so people could wait for one
or more indexAll requests if they want to.
There are many parameters to tweak the indexing process, so I'm
not sure if we should put them in the properties, or have aparameters-
wrapper object indexAll(Class class, Properties prop), or
something like makeIndexer(Class class) returning a complex object
with several setters for finetuning and start() and awaitTermination()
methods.

the easy part
--------------
This part is easy to do as I have it working well, it is a pattern
involving several executors; the size of each threadPool and of the
linking queues between them gives the good balance to achieve the
high throughput.
First the entities are counted and divided in blocks, these rangesare fed to
N scrollables opened in N threads, each thread begins iterating on the
list of entities and feeds detached entities to the next Pool using
BlockingQueues.
In the next pool the entities are re-attached using Lock.none,readonly, etc..
(and many others you may want to tell me) and we get and appropriate
DocumentBuilder from the SearchFactory to transform it into a LuceneDocument;this pool is usually the slowest as it has to initialize many lazyfields,
so there are more threads here.
Produced documents go to a smaller pool (best I found was for 2-3threads)
were data is concurrently written to the IndexWriter.
There's an additional thread for resource monitoring to produce somehintsabout queue sizing and idle threads, to do some finetune and to seeinstant
speed reports in logs when enabled.
For shutdown I use the "poison pill" pattern, and I usually get ridof all
threads and executors when I'm finished.
It needs some adaption to take into account of latest Search features
such as similarity, but is mostly beta-ready.

the difficult part
-------------------
Integrating it with the current locking scheme is not reallydifficult,
also because the goal is to minimize downtime so I think some downtime
should be acceptable.
It would be very nice however integrate this pattern as the default
writer for indexes, even "in transaction"; I think it could bepossibleeven in synchronous mode to split the work of a single transactionacross
the executors and wait for all the work be done at commit.
You probably don't want to see the "lots of threads" meant for batchindexing,
but the pools scale quite well to adapt themselves to the load,
and it's easy (as in clean and maintainable code) to enforceresource limits.When integrating at this level the system wouldn't need to stopregular
Search activity.

any questions? If someone wanted to reproduce my benchmarks I'll
be glad to send my current code and DB.

kind regards,
Sanne

_______________________________________________
hibernate-dev mailing list
hibernate-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hibernate-dev

[hibernate-dev] Re: Hibernate Search: massive batch indexing

Reply via email to