Re: [Hibernate] Hibernate Lucene integration

2006-07-17 Thread Max Rydahl Andersen

 Any Default
 Good question, I don't think we should index all properties by default:
 I guess we should ask that tot he Lucene community.

I think no-index is an apropriate default; but maybe a @IndexAll would be  
relevant ?
(not sure though)

 As for the bridges
 (ie types), they are defaulted, there is an heuristic guess mechanism.

sounds good.

 The default is FSDirectory with the base directory being ..
 RAMDirectory is of little use except for some specific usecases and for
 unit testing

Makes sense.

 I'm not a big fan of exposing the Lucene result itself but the relevance
 is something useful, I need to thing about that: the main problem is
 that I currently hide some of the plumbering to the user esp the
 searcher opening and closing, by doing so, there is no way to give the
 Hits (Lucene results).
 The ordering is preserved when returned by Hibernate.

How about turning this upside-down and let the user execute the query and
thus he have access to the lucene API query and could do something like:

session.createLucenceQuery(lucenceapiquery q).list() ?

 why session.index() a specific operation?
 Here is my reasoning:
  - using a lucene query to index a non index object is going to be hard
 since the lucene query will not return the object in the first place ;-)

details ;)

  - using a regular Hibernate query + a flag to index the objects suffers
 the OOME issue unless we use the stateless session. If I use the
 stateless session, I can't use the event system...

Why does this give OOME ? If you query for one object + flag it should be  
just
as heavy/light as index(), right ?

  - From what I've seen and guessed, what you want to (re)index is very
 business specific  and can be way more complex than just a query

mkay.

 Session delegation and callbacks
 Yes but Event Listeners are the current way to have a callback to the
 session. Event Listeners are stateless, the state being part of the  
 events.
 What we need is a way to push / keep some informations at the event /
 PersistenceContext level. The SessionDelegate would be another way to
 keep some state but make it hard to push the info to the eventlisteners

Yes, that would allow you to the #3 i talked about - having a  
SessionDelegate
that the underlying Session could call back to with enough context to  
allow you
to maintain it.

/max

 Max Rydahl Andersen wrote:

 Hi Emmanuel,

 Here are my comments (sorry if something is obvious from looking at the
 code,
 but haven't had time to look into the details yet)

  *Concepts*
  Each time you change an object state, the lucene index is updated and
  kept in sync. This is done through the Hibernate event system.

 Ok - sounds cool. The index is updated at flush or commit time ? (i
 assume
 commit)

  Whether an entity is indexed or not and whether a property is indexed  
 or
  not is defined through annotations.

 Any defaults?

  You can also search through your domain model using Lucene and  
 retrieve
  managed objects. The whole idea here is to do a nice integration  
 between
  the search engine and the ORM without loosing the search engine power,
  hence most of the API remains. To sum up, query Lucene, get managed
  object back.

 Cool.

  *Mapping*
  A given entity is mapped to an index. A lucene index is stored in a
  Directory, a Directory is a Lucnee abstract concept for index storage
  system. It can be a memory directory (RAMDirectory), a file system
  directory (FSDirectory) or any other kind of backend. Hibernate Lucene
  introduce the notion of DirectoryProvider that you can configure and
  define on a per entity basis (and wich is defaulted defaulted). The
  concept is very similar to ConnectionProvider.

 defaulted defaulted ? (defaulted to RAMDirectory maybe ?)

  Lucene only works with Strings, so you can define a @FieldBridge which
  transform a java property into a Lucene Field (and potentially
  vice-versa). A more simple (useful?) version handle the transformation
  of a java property into a String.
  Some built-in FieldBrigde exists. @FieldBridge is very much like an
  Hibernate Type. Esp I introduced the notion of precision in dates  
 (year,
  month, .. second, millisecond). This FieldBridge and StringBridge  
 gives
  a lot of flexibility in the way to design the property indexing.

 Sounds like a good thing.

  *Querying*
  I've introduced the notion of LuceneSession which implements Session  
 and
  actually delegates to a regular Hibernate Session. This lucene session
  has a /createLuceneQuery()/ method and a /index()/ method.
 
  /session.createLuceneQuery(lucene.Query, Class[])/ takes a Lucene  
 query
  as a parameter and the list of targeted entities. Using a Lucene query
  as a parameter gives the full Lucene flexibility (no abstraction on  
 top
  of it). An /org.hibernate.Query/ object is returned.
  You can (must) use pagination. A Lucene query also return the number  
 of
  matching results (regardless of the pagination): query.resultSize()  
 

Re: [Hibernate] Hibernate Lucene integration

2006-07-16 Thread Emmanuel Bernard




Hi,

Changes are propagated right after commit time. session.index() is a
different beast, but the usecase is different: index() will be applied
at flush time so that you can flush() and clear()

"Any Default"
Good question, I don't think we should index all properties by default:
I guess we should ask that tot he Lucene community. As for the bridges
(ie types), they are defaulted, there is an heuristic guess mechanism.

The default is FSDirectory with the base directory being ".".
RAMDirectory is of little use except for some specific usecases and for
unit testing

I'm not a big fan of exposing the Lucene result itself but the
relevance is something useful, I need to thing about that: the main
problem is that I currently hide some of the plumbering to the user esp
the searcher opening and closing, by doing so, there is no way to give
the Hits (Lucene results). 
The ordering is preserved when returned by Hibernate.

"why session.index() a specific operation?"
Here is my reasoning:
 - using a lucene query to index a non index object is going to be hard
since the lucene query will not return the object in the first place ;-)
 - using a regular Hibernate query + a flag to index the objects
suffers the OOME issue unless we use the stateless session. If I use
the stateless session, I can't use the event system...
 - From what I've seen and guessed, what you want to (re)index is very
business specific  and can be way more complex than just a query 

"Session delegation and callbacks"
Yes but Event Listeners are the current way to have a callback to the
session. Event Listeners are stateless, the state being part of the
events.
What we need is a way to push / keep some informations at the event /
PersistenceContext level. The SessionDelegate would be another way to
keep some state but make it hard to push the info to the eventlisteners

"massive update == very non-strict rw strategy"
could be but that's not the main problem. The main problem is to keep
somewhere the changes to apply even if the VM crash.

"implements additional strategies to load object on
query.list()"
Currently I do 
for all result session.load
for all result session.get
That way I benefit from the batch-size

Some other solutions would be to use a HQL query using a IN clause
containing the list of id to load

Max Rydahl Andersen wrote:

  
  
  Re: [Hibernate] Hibernate Lucene integration

  Hi Emmanuel,
  
Here are my comments (sorry if something is obvious from looking at the 
code,
but haven't had time to look into the details yet)
  
 *Concepts*
 Each time you change an object state, the lucene index is updated
and
 kept in sync. This is done through the Hibernate event system.
  
Ok - sounds cool. The index is updated at flush or commit time ? (i
assume 
commit)
  
 Whether an entity is indexed or not and whether a property is
indexed or
 not is defined through annotations.
  
Any defaults?
  
 You can also search through your domain model using Lucene and
retrieve
 managed objects. The whole idea here is to do a nice integration
between
 the search engine and the ORM without loosing the search engine
power,
 hence most of the API remains. To sum up, query Lucene, get managed
 object back.
  
Cool.
  
 *Mapping*
 A given entity is mapped to an index. A lucene index is stored in a
 Directory, a Directory is a Lucnee abstract concept for index
storage
 system. It can be a memory directory (RAMDirectory), a file system
 directory (FSDirectory) or any other kind of backend. Hibernate
Lucene
 introduce the notion of DirectoryProvider that you can configure
and
 define on a per entity basis (and wich is defaulted defaulted). The
 concept is very similar to ConnectionProvider.
  
defaulted defaulted ? (defaulted to RAMDirectory maybe ?)
  
 Lucene only works with Strings, so you can define a @FieldBridge
which
 transform a java property into a Lucene Field (and potentially
 vice-versa). A more simple (useful?) version handle the
transformation
 of a java property into a String.
 Some built-in FieldBrigde exists. @FieldBridge is very much like an
 Hibernate Type. Esp I introduced the notion of precision in dates
(year,
 month, .. second, millisecond). This FieldBridge and StringBridge
gives
 a lot of flexibility in the way to design the property indexing.
  
Sounds like a good thing.
  
 *Querying*
 I've introduced the notion of LuceneSession which implements
Session and
 actually delegates to a regular Hibernate Session. This lucene
session
 has a /createLuceneQuery()/ method and a /index()/ method.

 /session.createLuceneQuery(lucene.Query, Class[])/ takes a Lucene
query
 as a parameter and the list of targeted entities. Using a Lucene
query
 as a parameter gives the full Lucene flexibility (no abstraction
on top
 of it). An /org.hibernate.Query/ object is returned.
 You can (must) use pagination. A Lucene query also return the
number of
 matching results (regardless of the pagina

Re: [Hibernate] Hibernate Lucene integration

2006-07-14 Thread Max Rydahl Andersen
Hi Emmanuel,

Here are my comments (sorry if something is obvious from looking at the  
code,
but haven't had time to look into the details yet)

 *Concepts*
 Each time you change an object state, the lucene index is updated and
 kept in sync. This is done through the Hibernate event system.

Ok - sounds cool. The index is updated at flush or commit time ? (i assume  
commit)

 Whether an entity is indexed or not and whether a property is indexed or
 not is defined through annotations.

Any defaults?

 You can also search through your domain model using Lucene and retrieve
 managed objects. The whole idea here is to do a nice integration between
 the search engine and the ORM without loosing the search engine power,
 hence most of the API remains. To sum up, query Lucene, get managed
 object back.

Cool.

 *Mapping*
 A given entity is mapped to an index. A lucene index is stored in a
 Directory, a Directory is a Lucnee abstract concept for index storage
 system. It can be a memory directory (RAMDirectory), a file system
 directory (FSDirectory) or any other kind of backend. Hibernate Lucene
 introduce the notion of DirectoryProvider that you can configure and
 define on a per entity basis (and wich is defaulted defaulted). The
 concept is very similar to ConnectionProvider.

defaulted defaulted ? (defaulted to RAMDirectory maybe ?)

 Lucene only works with Strings, so you can define a @FieldBridge which
 transform a java property into a Lucene Field (and potentially
 vice-versa). A more simple (useful?) version handle the transformation
 of a java property into a String.
 Some built-in FieldBrigde exists. @FieldBridge is very much like an
 Hibernate Type. Esp I introduced the notion of precision in dates (year,
 month, .. second, millisecond). This FieldBridge and StringBridge gives
 a lot of flexibility in the way to design the property indexing.

Sounds like a good thing.

 *Querying*
 I've introduced the notion of LuceneSession which implements Session and
 actually delegates to a regular Hibernate Session. This lucene session
 has a /createLuceneQuery()/ method and a /index()/ method.

 /session.createLuceneQuery(lucene.Query, Class[])/ takes a Lucene query
 as a parameter and the list of targeted entities. Using a Lucene query
 as a parameter gives the full Lucene flexibility (no abstraction on top
 of it). An /org.hibernate.Query/ object is returned.
 You can (must) use pagination. A Lucene query also return the number of
 matching results (regardless of the pagination): query.resultSize() sort
 of count(*).

Is there any way to get to the underlying lucene result ?
As far as I remember Lucence also have some notion of result relevance and  
ordering
which could be relevant to reach ?

 Having the dynamic fetch profile would definitely be a killer pair
 (searching the lucene index, and fetching the appropriate object graph)

+1000 ;)

 /session.//index(Object)/ is currently not implemented it requires some
 modifications of SessionImpl or of LuceneSession. This feature is useful
 to initialize / refresh the index in a batch way (ie loading the data
 and applying the indexing process on this set of data).
 Basically the object is added to the index queue. At flush() time, the
 queue is processed.

hmm...why is this specific operation needed if it is done automatically
on object changes ?

And if it is something you want to allow users to index not-yet-indexed  
object
couldn't it be a flag or something on the LuceneQuery ?

e.g. s.createLuceneQuery(from X as x where x).setIndex(true) or  
maybe .setIndex(IndexMode.ONLY_NEW);

 design considerations:
 The delegation vs subclassing strategy for LuceneSession (ie
 LuceneSession delegating to a regular Session allowing simple wrapping
 or the LuceneSessionImpl being a subclass of SessionImpl is an ongoing
 discussion.

 Using a subclassing model would allow the LuceneSession to keep
 operation queues (for batch indexing either through object changes or
 through session.index() ), but it does not allow a potential Hibernate -
 XXX integration on the same subclassing model. Batching is essential in
 Lucene for performance reasons.
 Using the delegation model requires some SessionImpl modifications to be
 able to keep track of a generic context. This context will keep the
 operation queues.


 *ToDo*
 Argue on the LuceneSession design are pick up one (Steve/Emmanuel/Feel
 free to join the danse)

I vote for a impl that will allow an existing Session to be the basis of  
extension;
thus not having Lucene integrating be a hardcoded subclasswe did the  
same
for Configuration and that is smelly/inflexible.

We should open enough of the session up to allow such delegation.

This might be extremely hard and close to impossible, but that is what I  
wish for ;)

 Find a way to keep the DocumentBuilder (sort of EntityPersister) at the
 SessionFactory level rather than the EventListener level (Steve/Emmanuel)

Finding a way of storing structured info/data 

Re: [Hibernate] Hibernate Lucene integration

2006-07-13 Thread Emmanuel Bernard

 I've worked a lot recently on the Hibernate Lucene integration. Here 
 are the concepts, the new features and the todo list.
 Please comment and give feedbacks.

 My work is commited in branches/Lucene_Integration because we'll 
 probably need to be based on Hibernate 3.3

 *Concepts*
 Each time you change an object state, the lucene index is updated and 
 kept in sync. This is done through the Hibernate event system.
 Whether an entity is indexed or not and whether a property is indexed 
 or not is defined through annotations.
 You can also search through your domain model using Lucene and 
 retrieve managed objects. The whole idea here is to do a nice 
 integration between the search engine and the ORM without loosing the 
 search engine power, hence most of the API remains. To sum up, query 
 Lucene, get managed object back.

 *Mapping*
 A given entity is mapped to an index. A lucene index is stored in a 
 Directory, a Directory is a Lucnee abstract concept for index storage 
 system. It can be a memory directory (RAMDirectory), a file system 
 directory (FSDirectory) or any other kind of backend. Hibernate Lucene 
 introduce the notion of DirectoryProvider that you can configure and 
 define on a per entity basis (and wich is defaulted defaulted). The 
 concept is very similar to ConnectionProvider.

 Lucene only works with Strings, so you can define a @FieldBridge which 
 transform a java property into a Lucene Field (and potentially 
 vice-versa). A more simple (useful?) version handle the transformation 
 of a java property into a String.
 Some built-in FieldBrigde exists. @FieldBridge is very much like an 
 Hibernate Type. Esp I introduced the notion of precision in dates 
 (year, month, .. second, millisecond). This FieldBridge and 
 StringBridge gives a lot of flexibility in the way to design the 
 property indexing.


 *Querying*
 I've introduced the notion of LuceneSession which implements Session 
 and actually delegates to a regular Hibernate Session. This lucene 
 session has a /createLuceneQuery()/ method and a /index()/ method.

 /session.createLuceneQuery(lucene.Query, Class[])/ takes a Lucene 
 query as a parameter and the list of targeted entities. Using a Lucene 
 query as a parameter gives the full Lucene flexibility (no abstraction 
 on top of it). An /org.hibernate.Query/ object is returned.
 You can (must) use pagination. A Lucene query also return the number 
 of matching results (regardless of the pagination): query.resultSize() 
 sort of count(*).
/list()/ returns the list of matching objects. It heavily depends 
 on batch-size to be efficient (ie the proxy are created for all the 
 results and then we initialize them.
 There might be alternative strategies here (select ... where id in ( , 
 , , ) ), but the real benefit would come if combined with the dynamic 
 fetching profile we talked about a while ago.
/iterate()/ has the same semantic as the regular method in 
 hibernate, meaning initialize the objects one by one.
/scroll()/ allows an efficient navigation into the resultset, 
 (objects are loaded one by one though).
 Having the dynamic fetch profile would definitely be a killer pair 
 (searching the lucene index, and fetching the appropriate object graph)

 /session.//index(Object)/ is currently not implemented it requires 
 some modifications of SessionImpl or of LuceneSession. This feature is 
 useful to initialize / refresh the index in a batch way (ie loading 
 the data and applying the indexing process on this set of data).
 Basically the object is added to the index queue. At flush() time, the 
 queue is processed.

 design considerations:
 The delegation vs subclassing strategy for LuceneSession (ie 
 LuceneSession delegating to a regular Session allowing simple wrapping 
 or the LuceneSessionImpl being a subclass of SessionImpl is an ongoing 
 discussion.
 Using a subclassing model would allow the LuceneSession to keep 
 operation queues (for batch indexing either through object changes or 
 through session.index() ), but it does not allow a potential Hibernate 
 - XXX integration on the same subclassing model. Batching is essential 
 in Lucene for performance reasons.
 Using the delegation model requires some SessionImpl modifications to 
 be able to keep track of a generic context. This context will keep the 
 operation queues.


 *ToDo*
 Argue on the LuceneSession design are pick up one (Steve/Emmanuel/Feel 
 free to join the danse)

 Find a way to keep the DocumentBuilder (sort of EntityPersister) at 
 the SessionFactory level rather than the EventListener level 
 (Steve/Emmanuel)

 Implement the use of FieldBridge for all properties. It is currently 
 used for the id property only (trivial).

 Batch changes: to do that I need to be able to keep a session related 
 queue of all insert/update changes. I can't in the current design 
 because SessionImpl does not have such concept and because the 
 LuceneSession is build on the delegation model. We