[Hibernate] Plan for a full text search facility built on top of Hibernate Lucene annotations

2006-06-01 Thread Sylvain Vieujot




After chatting with Emmanuel, here is a draft plan for a closer integration between Hibernate and Lucene for performing full text queries.
Hibernate annotations for Lucene helps keeping the lucene indexes up to date, but doesn't provide a query facility.
It also lacks converters that would for example help store a Date with the proper format in Lucene, so that the alphabetic order matches the Object's natural order.

A framework like Compass ( http://www.opensymphony.com/compass ) is meant to fix this problem, by implementing it's own OSEM (Object, Search Engine Mapping), and having a query facility that mimics what hibernate is doing with database side.
Compass can even reuse Hibernate's mapping thus minimizing the configuration effort.

One short coming I've found with Compass though is that the objects that you get when you query the full text engine aren't connected to the ones in the database.
So if you manipulate them, the changes aren't persisted or can actually erase some of the information in the database.

The best way to have a simple and risk free integration is to build a Full Text query facility that would be closely integrated with Hibernate  Hibernate Lucene annotations.

So, querying the Full Text indexes would return objects, like Compass does, but those objects would be fetched from the database.
Actually, for performance reasons, they could be initialized with the information from the FT index, and, through byte code enhancement, if an uninitialized property is read, or a property is set, the real object could be fetched from the database and read/set accordingly.
Here are a few examples :

1) Just make a full text search :
query toto would fetch all the object with an indexed field containing toto from the Lucene index.
If the objects are initialized from the Lucene index, just one query to the Lucene index is done, and the search results can be displayed.
= Best performance.
Loading the objects from the database is useless here, and would only lead to poorer performances.

2) Make a full text search AND manipulate the objects :
You want to query all the objects with toto, and increment their searchHits property.
You do the query, with a Load.EAGER parameter.
Only the objects' ids are retrieved from Lucene, and the real objects are retrieved from Hibernate

3) Mix both approaches
Requires byte code enhancements.
Can be useful for cases where for some types of objects you don't want to store all the properties required to display the search view results in the index.
Only those objects will be loaded from Hibernate.

All 3 modes should work, but we can always begin by implementing mode 2 only (retrieving only the id's from Lucene, and initializing the objects from Hibernate).
Everything will still work, but performance will not be optimal.
Later on we can implement mode 3 (which would also solve situation 1), and the changes will be transparent to the user.
Only the performance will be better.


Another advantage of integrating the Full Text query closely with Hibernate is that in some cases where a field isn't indexed but the query is still simple (fiels x like toto%), the Lucene index would not be needed, and some queries can be performed directly via Hibernate in a transparent way for the user.

To summarize this, the biggest changes would be :

- Add converters to Hibernate Lucene annotations, like what Compass is doing : http://www.opensymphony.com/compass/versions/0.9.1/html/core-settings.html#config-converter

- Build a Full Text query facility similar to Hql / Criteria, but focussed on full text search (also like Compass's one : http://www.opensymphony.com/compass/versions/0.9.1/html/core-workingwithobjects.html#Searching ) but that would make sure the objects retrieved from the Lucene index behave as if they were retrieved from the database.


I would be glad to ear from your feedback on this.

Thanks,

Sylvain.


___
hibernate-devel mailing list
hibernate-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/hibernate-devel


Re: [Hibernate] Plan for a full text search facility built on top of Hibernate Lucene annotations

2006-06-01 Thread Max Rydahl Andersen

All sounds cool ;)

I can see the advantage of converters which can put elements into  
Lucence in a better/human manner.

The loading of objects from Lucene + yet another QL I'm a bit more  
critical about.

Would it not be better to do the following:

1. Use whatever QL Lucene supports to express the query. (What does  
another QL helps here ?)

2. Do the query against the Lucence index and return id's which then is  
resolved via Hibernate
and possible in 2nd lvl cache. (We could maybe optimize the id lookups via  
some targetted queries)

3. IFF you really want look into have Lucene be a 2nd lvl cache provider ?  
(would probably require a chainable cacheprovider to have both lucence  
and ehcache queries in the same app...but that is sugar)

...maybe there is something I miss because I don't understand what the  
mixed mode means and why you
want bytecode enhancement mixed in here ?

/max

 After chatting with Emmanuel, here is a draft plan for a closer
 integration between Hibernate and Lucene for performing full text
 queries.
 Hibernate annotations for Lucene helps keeping the lucene indexes up to
 date, but doesn't provide a query facility.
 It also lacks converters that would for example help store a Date with
 the proper format in Lucene, so that the alphabetic order matches the
 Object's natural order.

 A framework like Compass ( http://www.opensymphony.com/compass ) is
 meant to fix this problem, by implementing it's own OSEM (Object, Search
 Engine Mapping), and having a query facility that mimics what hibernate
 is doing with database side.
 Compass can even reuse Hibernate's mapping thus minimizing the
 configuration effort.

 One short coming I've found with Compass though is that the objects that
 you get when you query the full text engine aren't connected to the ones
 in the database.
 So if you manipulate them, the changes aren't persisted or can actually
 erase some of the information in the database.

 The best way to have a simple and risk free integration is to build a
 Full Text query facility that would be closely integrated with Hibernate
  Hibernate Lucene annotations.

 So, querying the Full Text indexes would return objects, like Compass
 does, but those objects would be fetched from the database.
 Actually, for performance reasons, they could be initialized with the
 information from the FT index, and, through byte code enhancement, if an
 uninitialized property is read, or  a property is set, the real object
 could be fetched from the database and read/set accordingly.
 Here are a few examples :

 1) Just make a full text search :
 query toto would fetch all the object with an indexed field
 containing toto from the Lucene index.
 If the objects are initialized from the Lucene index, just one
 query to the Lucene index is done, and the search results can be
 displayed.
 = Best performance.
 Loading the objects from the database is useless here, and would
 only lead to poorer performances.
2) Make a full text search AND manipulate the objects :
 You want to query all the objects with toto, and increment
 their searchHits property.
 You do the query, with a Load.EAGER parameter.
 Only the objects' ids are retrieved from Lucene, and the real
 objects are retrieved from Hibernate
3) Mix both approaches
 Requires byte code enhancements.
 Can be useful for cases where for some types of objects you
 don't want to store all the properties required to display the
 search view results in the index.
 Only those objects will be loaded from Hibernate.
All 3 modes should work, but we can always begin by implementing
 mode 2 only (retrieving only the id's from Lucene, and
 initializing the objects from Hibernate).
 Everything will still work, but performance will not be optimal.
 Later on we can implement mode 3 (which would also solve
 situation 1), and the changes will be transparent to the user.
 Only the performance will be better.

 Another advantage of integrating the Full Text query closely with
 Hibernate is that in some cases where a field isn't indexed but the
 query is still simple (fiels x like toto%), the Lucene index would not
 be needed, and some queries can be performed directly via Hibernate in a
 transparent way for the user.

 To summarize this, the biggest changes would be :

 - Add converters to Hibernate Lucene annotations, like what
 Compass is doing :
 
 http://www.opensymphony.com/compass/versions/0.9.1/html/core-settings.html#config-converter
- Build a Full Text query facility similar to Hql / Criteria,
 but focussed on full text search (also like Compass's one :
 
 http://www.opensymphony.com/compass/versions/0.9.1/html/core-workingwithobjects.html#Searching
   
 ) but that would make sure 

Re: [Hibernate] Plan for a full text search facility built on top of Hibernate Lucene annotations

2006-06-01 Thread Sylvain Vieujot




About the QL :
You're right, the goal isn't to provide yet another QL, and Lucene's one should be used.
I meant having a Criteria type of QL, like what Compass does :CompassQueryBuilder queryBuilder = session.createQueryBuilder();

CompassHits hits =  queryBuilder.bool()
.addMust( queryBuilder.term(name, jack) )
.addMustNot( queryBuilder.term(familyName, london) )
  .toQuery()
.addSort(familyName, CompassQuery.SortPropertyType.STRING)
.addSort(birthdate, CompassQuery.SortPropertyType.INT)
  .hits();

More details here :
http://www.opensymphony.com/compass/versions/0.9.1/html/core-workingwithobjects.html#Searching

About the cache :
You're probably right, but I don't know enough about this.
I only know Compass also provides some cache.

About the bytecode enhancement :
This one is quite important.
Support you have several types of Objects that have an report property, and you want to show all those documents containing the word toto in their report property.
The best way is for the query facility to return a collection of those documents with their id  report property set (which can be done only by getting the result from Lucene), without ever touching the SQL database. Forcing all those objects, that might be persisted in different tables, to be loaded by Hibernate would be both a performance killer and useless.
But then, if you ever decide to do more than access one of the Lucene initialized property, you will need those documents to be loaded from Hibernate. This can only be done through some kind of wrapper / mock / byte enhancement, whatever you call it. This is what mixed mode means. You initialize the objects from the Lucene index, but later fetch the real persisted object from the database as needed, and in a transparent way for the user.
As I said, in a first implementation, we can always fetch eager from Hibernate, but some provision should be made to avoid loading from the database when it's not necessary.
If you use mostly the full text search to display search result pages, then most of the time, you'll never need to hit the database.

Sylvain.

On Thu, 2006-06-01 at 11:23 +0200, Max Rydahl Andersen wrote:


All sounds cool ;)

I can see the advantage of converters which can put elements into  
Lucence in a better/human manner.

The loading of objects from Lucene + yet another QL I'm a bit more  
critical about.

Would it not be better to do the following:

1. Use whatever QL Lucene supports to express the query. (What does  
another QL helps here ?)

2. Do the query against the Lucence index and return id's which then is  
resolved via Hibernate
and possible in 2nd lvl cache. (We could maybe optimize the id lookups via  
some targetted queries)

3. IFF you really want look into have Lucene be a 2nd lvl cache provider ?  
(would probably require a chainable cacheprovider to have both lucence  
and ehcache queries in the same app...but that is sugar)

...maybe there is something I miss because I don't understand what the  
mixed mode means and why you
want bytecode enhancement mixed in here ?

/max

 After chatting with Emmanuel, here is a draft plan for a closer
 integration between Hibernate and Lucene for performing full text
 queries.
 Hibernate annotations for Lucene helps keeping the lucene indexes up to
 date, but doesn't provide a query facility.
 It also lacks converters that would for example help store a Date with
 the proper format in Lucene, so that the alphabetic order matches the
 Object's natural order.

 A framework like Compass ( http://www.opensymphony.com/compass ) is
 meant to fix this problem, by implementing it's own OSEM (Object, Search
 Engine Mapping), and having a query facility that mimics what hibernate
 is doing with database side.
 Compass can even reuse Hibernate's mapping thus minimizing the
 configuration effort.

 One short coming I've found with Compass though is that the objects that
 you get when you query the full text engine aren't connected to the ones
 in the database.
 So if you manipulate them, the changes aren't persisted or can actually
 erase some of the information in the database.

 The best way to have a simple and risk free integration is to build a
 Full Text query facility that would be closely integrated with Hibernate
  Hibernate Lucene annotations.

 So, querying the Full Text indexes would return objects, like Compass
 does, but those objects would be fetched from the database.
 Actually, for performance reasons, they could be initialized with the
 information from the FT index, and, through byte code enhancement, if an
 uninitialized property is read, or  a property is set, the real object
 could be fetched from the database and read/set accordingly.
 Here are a few examples :

 1) Just make a full text search :
 query toto would fetch all the object with an indexed field
 containing toto from the Lucene index.
 If the objects are initialized from the Lucene index, just one
 

Re: [Hibernate] Plan for a full text search facility built on top of Hibernate Lucene annotations

2006-06-01 Thread Christian Bauer

On Jun 1, 2006, at 12:31 PM, Sylvain Vieujot wrote:

 CompassHits hits = queryBuilder.bool() .addMust( queryBuilder.term 
 (name, jack) ) .addMustNot( queryBuilder.term(familyName,  
 london) ) .toQuery() .addSort(familyName,  
 CompassQuery.SortPropertyType.STRING) .addSort(birthdate,  
 CompassQuery.SortPropertyType.INT) .hits();

(Note: Compass is a fork of the Hibernate codebase)

This looks to me like a new Criterion implementation, the rest is  
regular Criteria API.



___
hibernate-devel mailing list
hibernate-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/hibernate-devel


Re: [Hibernate] Plan for a full text search facility built on top of Hibernate Lucene annotations

2006-06-01 Thread Max Rydahl Andersen
On Thu, 01 Jun 2006 12:45:49 +0200, Christian Bauer  
[EMAIL PROTECTED] wrote:


 On Jun 1, 2006, at 12:31 PM, Sylvain Vieujot wrote:

 CompassHits hits = queryBuilder.bool() .addMust( queryBuilder.term
 (name, jack) ) .addMustNot( queryBuilder.term(familyName,
 london) ) .toQuery() .addSort(familyName,
 CompassQuery.SortPropertyType.STRING) .addSort(birthdate,
 CompassQuery.SortPropertyType.INT) .hits();

 (Note: Compass is a fork of the Hibernate codebase)

 This looks to me like a new Criterion implementation, the rest is
 regular Criteria API.

But it needs an alternative rendering + loader. Hence my ref to exposing  
CustomQuery.

-- 
--
Max Rydahl Andersen
callto://max.rydahl.andersen

Hibernate
[EMAIL PROTECTED]
http://hibernate.org

JBoss Inc
[EMAIL PROTECTED]


___
hibernate-devel mailing list
hibernate-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/hibernate-devel


Re: [Hibernate] Plan for a full text search facility built on top of Hibernate Lucene annotations

2006-06-01 Thread Max Rydahl Andersen

 I meant having a Criteria type of QL, like what Compass
 does :CompassQueryBuilder queryBuilder = session.createQueryBuilder();

 CompassHits hits =  queryBuilder.bool()
 .addMust( queryBuilder.term(name, jack) )
 .addMustNot( queryBuilder.term(familyName, london) )
   .toQuery()
 .addSort(familyName, CompassQuery.SortPropertyType.STRING)
 .addSort(birthdate, CompassQuery.SortPropertyType.INT)
   .hits();

well...doesn't there exist some existing object version of lucene query  
api or something ?

 About the cache :
 You're probably right, but I don't know enough about this.
 I only know Compass also provides some cache.

 About the bytecode enhancement :
 This one is quite important.

Ok, here you/we should probably utilize the lazy properties support we  
already have,
but it might require more customization than we have now.

Actually all this stuff might be good usecase candidate for alot of things  
we have
talked about doing at some point:

1) Expose CustomQuery to provide hooks for alternative querying

2) Fetch-profiles to allow you to define a lucene fetch plan for your  
partial objects.

3) Lucene as a 2nd lvl cache.

/max

 Support you have several types of Objects that have an report
 property, and you want to show all those documents containing the word
 toto in their report property.
 The best way is for the query facility to return a collection of those
 documents with their id  report property set (which can be done only by
 getting the result from Lucene), without ever touching the SQL database.
 Forcing all those objects, that might be persisted in different tables,
 to be loaded by Hibernate would be both a performance killer and
 useless.
 But then, if you ever decide to do more than access one of the Lucene
 initialized property, you will need those documents to be loaded from
 Hibernate. This can only be done through some kind of wrapper / mock /
 byte enhancement, whatever you call it. This is what mixed mode means.
 You initialize the objects from the Lucene index, but later fetch the
 real persisted object from the database as needed, and in a transparent
 way for the user.
 As I said, in a first implementation, we can always fetch eager from
 Hibernate, but some provision should be made to avoid loading from the
 database when it's not necessary.
 If you use mostly the full text search to display search result pages,
 then most of the time, you'll never need to hit the database.

 Sylvain.

 On Thu, 2006-06-01 at 11:23 +0200, Max Rydahl Andersen wrote:

 All sounds cool ;)

 I can see the advantage of converters which can put elements into
 Lucence in a better/human manner.

 The loading of objects from Lucene + yet another QL I'm a bit more
 critical about.

 Would it not be better to do the following:

 1. Use whatever QL Lucene supports to express the query. (What does
 another QL helps here ?)

 2. Do the query against the Lucence index and return id's which then is
 resolved via Hibernate
 and possible in 2nd lvl cache. (We could maybe optimize the id lookups  
 via
 some targetted queries)

 3. IFF you really want look into have Lucene be a 2nd lvl cache  
 provider ?
 (would probably require a chainable cacheprovider to have both lucence
 and ehcache queries in the same app...but that is sugar)

 ...maybe there is something I miss because I don't understand what the
 mixed mode means and why you
 want bytecode enhancement mixed in here ?

 /max

  After chatting with Emmanuel, here is a draft plan for a closer
  integration between Hibernate and Lucene for performing full text
  queries.
  Hibernate annotations for Lucene helps keeping the lucene indexes up  
 to
  date, but doesn't provide a query facility.
  It also lacks converters that would for example help store a Date with
  the proper format in Lucene, so that the alphabetic order matches the
  Object's natural order.
 
  A framework like Compass ( http://www.opensymphony.com/compass ) is
  meant to fix this problem, by implementing it's own OSEM (Object,  
 Search
  Engine Mapping), and having a query facility that mimics what  
 hibernate
  is doing with database side.
  Compass can even reuse Hibernate's mapping thus minimizing the
  configuration effort.
 
  One short coming I've found with Compass though is that the objects  
 that
  you get when you query the full text engine aren't connected to the  
 ones
  in the database.
  So if you manipulate them, the changes aren't persisted or can  
 actually
  erase some of the information in the database.
 
  The best way to have a simple and risk free integration is to build a
  Full Text query facility that would be closely integrated with  
 Hibernate
   Hibernate Lucene annotations.
 
  So, querying the Full Text indexes would return objects, like Compass
  does, but those objects would be fetched from the database.
  Actually, for performance reasons, they could be initialized with the
  information from the FT index, and, through byte code