Hello, because of HSEARCH-268( optimize indexes in parallel ) but also for other purposes, I am in need to define a new ThreadPool in Hibernate Search's Lucene backend. The final effect will actually be that all changes to indexes are going to be performed in parallel (on different indexes). I consider this a major improvement, and is currently easy to implement, iff we solve the following problems.
The question is about how to size it properly, and how should the parallel workers interact, especially regarding commit failures and rollbacks: about the size ========= I've considered some options: 1) "steal" the configuration setting from BatchedQueueingProcessor, transforming that implementation in singlethreaded, and reusing the parameter internally to the Lucene backend only (JMS doesn't need it AFAIK). I'm afraid this could break custom made backends configuration parsing. 2)add a new parameter to the environment 3)use a size equal to the number of DirectoryProviders (this is the optimal value, could be the default and be overriden by a parameter). 4)change the contract of BackendQueueProcessorFactory: instead of returning one Runnable it returns a list of Runnables, so it's possible to use the existing Executor. This needs some consideration about how different Runnables have to "join the same TX"; The JMS implementation could return just one Runnable, so no worry about that. about transactions ============ As you know Search is not using a two phase commit between DB and Index, but Emmanuel has a very cool vision about that: we could add that later. The problem is: what to do if a Lucene index update fails (e.g. index A is corrupted), should we cancel the tasks going to make changes to the other indexes, B and C? That would be possible, but I don't think that you like that: after all the database changes are committed already, so I should actually make a "best effort" to update all indexes which are still working correctly. Another option would be to make the changes to all indexes, and then IndexWriter.commit() them all after they are all done. This is the opposite of the previous example, and also more complex to implement. I personally don't like this, but would like to hear more voices as it is an important matter. I think Search should work on a "best effort" criteria for next release: update all indexes it is able to. In a future one we could add an option to make it "two phase" optionally) by playing with the new Lucene commit() capabilities, but this would only make sense if you actually wanted to rollback the database changes in case of an index failure. sharing IndexWriter in batch mode ===================== this is not needed for HSEARCH-268( optimize indexes in parallel ) but is needed to get a major boost in indexing performance. Currently the IndexWriter lifecycle is coupled to the operations done in a transaction; (also Emmanuel reminded me we need to release the file lock ASAP as a supported configuration is to use two Search instances sharing the same FS-based index). We already have the concept of "batch operation" and "transactional operation"; the only difference is currently about which tuning settings are applied to the IndexWriter. My idea is to extend the semantics of "batch mode" to mean a state which is globally affecting the way IndexWriters are aquired and released: when in batch mode, the IndexWriter is not closed at the end of each work queue, and the locks are not used: the IndexWriter could be shared across different threads. This is not transactionally safe of course, but that's why this is called "batch mode" opposing to "transactional mode": nobody would expect transactional behaviour. There should be taken care to revert the status to "transaction mode" and close the IndexWriter at the end, but this API would make me reindex the database using the "parallel scrollableresults" in the most efficient way, and nicely integrated. This isn't as complicated to implement as it is to explain;-) Sanne _______________________________________________ hibernate-dev mailing list hibernate-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev