Re: [Hibernate] Hibernate Lucene integration
Any Default Good question, I don't think we should index all properties by default: I guess we should ask that tot he Lucene community. I think no-index is an apropriate default; but maybe a @IndexAll would be relevant ? (not sure though) As for the bridges (ie types), they are defaulted, there is an heuristic guess mechanism. sounds good. The default is FSDirectory with the base directory being .. RAMDirectory is of little use except for some specific usecases and for unit testing Makes sense. I'm not a big fan of exposing the Lucene result itself but the relevance is something useful, I need to thing about that: the main problem is that I currently hide some of the plumbering to the user esp the searcher opening and closing, by doing so, there is no way to give the Hits (Lucene results). The ordering is preserved when returned by Hibernate. How about turning this upside-down and let the user execute the query and thus he have access to the lucene API query and could do something like: session.createLucenceQuery(lucenceapiquery q).list() ? why session.index() a specific operation? Here is my reasoning: - using a lucene query to index a non index object is going to be hard since the lucene query will not return the object in the first place ;-) details ;) - using a regular Hibernate query + a flag to index the objects suffers the OOME issue unless we use the stateless session. If I use the stateless session, I can't use the event system... Why does this give OOME ? If you query for one object + flag it should be just as heavy/light as index(), right ? - From what I've seen and guessed, what you want to (re)index is very business specific and can be way more complex than just a query mkay. Session delegation and callbacks Yes but Event Listeners are the current way to have a callback to the session. Event Listeners are stateless, the state being part of the events. What we need is a way to push / keep some informations at the event / PersistenceContext level. The SessionDelegate would be another way to keep some state but make it hard to push the info to the eventlisteners Yes, that would allow you to the #3 i talked about - having a SessionDelegate that the underlying Session could call back to with enough context to allow you to maintain it. /max Max Rydahl Andersen wrote: Hi Emmanuel, Here are my comments (sorry if something is obvious from looking at the code, but haven't had time to look into the details yet) *Concepts* Each time you change an object state, the lucene index is updated and kept in sync. This is done through the Hibernate event system. Ok - sounds cool. The index is updated at flush or commit time ? (i assume commit) Whether an entity is indexed or not and whether a property is indexed or not is defined through annotations. Any defaults? You can also search through your domain model using Lucene and retrieve managed objects. The whole idea here is to do a nice integration between the search engine and the ORM without loosing the search engine power, hence most of the API remains. To sum up, query Lucene, get managed object back. Cool. *Mapping* A given entity is mapped to an index. A lucene index is stored in a Directory, a Directory is a Lucnee abstract concept for index storage system. It can be a memory directory (RAMDirectory), a file system directory (FSDirectory) or any other kind of backend. Hibernate Lucene introduce the notion of DirectoryProvider that you can configure and define on a per entity basis (and wich is defaulted defaulted). The concept is very similar to ConnectionProvider. defaulted defaulted ? (defaulted to RAMDirectory maybe ?) Lucene only works with Strings, so you can define a @FieldBridge which transform a java property into a Lucene Field (and potentially vice-versa). A more simple (useful?) version handle the transformation of a java property into a String. Some built-in FieldBrigde exists. @FieldBridge is very much like an Hibernate Type. Esp I introduced the notion of precision in dates (year, month, .. second, millisecond). This FieldBridge and StringBridge gives a lot of flexibility in the way to design the property indexing. Sounds like a good thing. *Querying* I've introduced the notion of LuceneSession which implements Session and actually delegates to a regular Hibernate Session. This lucene session has a /createLuceneQuery()/ method and a /index()/ method. /session.createLuceneQuery(lucene.Query, Class[])/ takes a Lucene query as a parameter and the list of targeted entities. Using a Lucene query as a parameter gives the full Lucene flexibility (no abstraction on top of it). An /org.hibernate.Query/ object is returned. You can (must) use pagination. A Lucene query also return the number of matching results (regardless of the pagination): query.resultSize()
Re: [Hibernate] Hibernate Lucene integration
Hi, Changes are propagated right after commit time. session.index() is a different beast, but the usecase is different: index() will be applied at flush time so that you can flush() and clear() "Any Default" Good question, I don't think we should index all properties by default: I guess we should ask that tot he Lucene community. As for the bridges (ie types), they are defaulted, there is an heuristic guess mechanism. The default is FSDirectory with the base directory being ".". RAMDirectory is of little use except for some specific usecases and for unit testing I'm not a big fan of exposing the Lucene result itself but the relevance is something useful, I need to thing about that: the main problem is that I currently hide some of the plumbering to the user esp the searcher opening and closing, by doing so, there is no way to give the Hits (Lucene results). The ordering is preserved when returned by Hibernate. "why session.index() a specific operation?" Here is my reasoning: - using a lucene query to index a non index object is going to be hard since the lucene query will not return the object in the first place ;-) - using a regular Hibernate query + a flag to index the objects suffers the OOME issue unless we use the stateless session. If I use the stateless session, I can't use the event system... - From what I've seen and guessed, what you want to (re)index is very business specific and can be way more complex than just a query "Session delegation and callbacks" Yes but Event Listeners are the current way to have a callback to the session. Event Listeners are stateless, the state being part of the events. What we need is a way to push / keep some informations at the event / PersistenceContext level. The SessionDelegate would be another way to keep some state but make it hard to push the info to the eventlisteners "massive update == very non-strict rw strategy" could be but that's not the main problem. The main problem is to keep somewhere the changes to apply even if the VM crash. "implements additional strategies to load object on query.list()" Currently I do for all result session.load for all result session.get That way I benefit from the batch-size Some other solutions would be to use a HQL query using a IN clause containing the list of id to load Max Rydahl Andersen wrote: Re: [Hibernate] Hibernate Lucene integration Hi Emmanuel, Here are my comments (sorry if something is obvious from looking at the code, but haven't had time to look into the details yet) *Concepts* Each time you change an object state, the lucene index is updated and kept in sync. This is done through the Hibernate event system. Ok - sounds cool. The index is updated at flush or commit time ? (i assume commit) Whether an entity is indexed or not and whether a property is indexed or not is defined through annotations. Any defaults? You can also search through your domain model using Lucene and retrieve managed objects. The whole idea here is to do a nice integration between the search engine and the ORM without loosing the search engine power, hence most of the API remains. To sum up, query Lucene, get managed object back. Cool. *Mapping* A given entity is mapped to an index. A lucene index is stored in a Directory, a Directory is a Lucnee abstract concept for index storage system. It can be a memory directory (RAMDirectory), a file system directory (FSDirectory) or any other kind of backend. Hibernate Lucene introduce the notion of DirectoryProvider that you can configure and define on a per entity basis (and wich is defaulted defaulted). The concept is very similar to ConnectionProvider. defaulted defaulted ? (defaulted to RAMDirectory maybe ?) Lucene only works with Strings, so you can define a @FieldBridge which transform a java property into a Lucene Field (and potentially vice-versa). A more simple (useful?) version handle the transformation of a java property into a String. Some built-in FieldBrigde exists. @FieldBridge is very much like an Hibernate Type. Esp I introduced the notion of precision in dates (year, month, .. second, millisecond). This FieldBridge and StringBridge gives a lot of flexibility in the way to design the property indexing. Sounds like a good thing. *Querying* I've introduced the notion of LuceneSession which implements Session and actually delegates to a regular Hibernate Session. This lucene session has a /createLuceneQuery()/ method and a /index()/ method. /session.createLuceneQuery(lucene.Query, Class[])/ takes a Lucene query as a parameter and the list of targeted entities. Using a Lucene query as a parameter gives the full Lucene flexibility (no abstraction on top of it). An /org.hibernate.Query/ object is returned. You can (must) use pagination. A Lucene query also return the number of matching results (regardless of the pagina
Re: [Hibernate] Hibernate Lucene integration
Hi Emmanuel, Here are my comments (sorry if something is obvious from looking at the code, but haven't had time to look into the details yet) *Concepts* Each time you change an object state, the lucene index is updated and kept in sync. This is done through the Hibernate event system. Ok - sounds cool. The index is updated at flush or commit time ? (i assume commit) Whether an entity is indexed or not and whether a property is indexed or not is defined through annotations. Any defaults? You can also search through your domain model using Lucene and retrieve managed objects. The whole idea here is to do a nice integration between the search engine and the ORM without loosing the search engine power, hence most of the API remains. To sum up, query Lucene, get managed object back. Cool. *Mapping* A given entity is mapped to an index. A lucene index is stored in a Directory, a Directory is a Lucnee abstract concept for index storage system. It can be a memory directory (RAMDirectory), a file system directory (FSDirectory) or any other kind of backend. Hibernate Lucene introduce the notion of DirectoryProvider that you can configure and define on a per entity basis (and wich is defaulted defaulted). The concept is very similar to ConnectionProvider. defaulted defaulted ? (defaulted to RAMDirectory maybe ?) Lucene only works with Strings, so you can define a @FieldBridge which transform a java property into a Lucene Field (and potentially vice-versa). A more simple (useful?) version handle the transformation of a java property into a String. Some built-in FieldBrigde exists. @FieldBridge is very much like an Hibernate Type. Esp I introduced the notion of precision in dates (year, month, .. second, millisecond). This FieldBridge and StringBridge gives a lot of flexibility in the way to design the property indexing. Sounds like a good thing. *Querying* I've introduced the notion of LuceneSession which implements Session and actually delegates to a regular Hibernate Session. This lucene session has a /createLuceneQuery()/ method and a /index()/ method. /session.createLuceneQuery(lucene.Query, Class[])/ takes a Lucene query as a parameter and the list of targeted entities. Using a Lucene query as a parameter gives the full Lucene flexibility (no abstraction on top of it). An /org.hibernate.Query/ object is returned. You can (must) use pagination. A Lucene query also return the number of matching results (regardless of the pagination): query.resultSize() sort of count(*). Is there any way to get to the underlying lucene result ? As far as I remember Lucence also have some notion of result relevance and ordering which could be relevant to reach ? Having the dynamic fetch profile would definitely be a killer pair (searching the lucene index, and fetching the appropriate object graph) +1000 ;) /session.//index(Object)/ is currently not implemented it requires some modifications of SessionImpl or of LuceneSession. This feature is useful to initialize / refresh the index in a batch way (ie loading the data and applying the indexing process on this set of data). Basically the object is added to the index queue. At flush() time, the queue is processed. hmm...why is this specific operation needed if it is done automatically on object changes ? And if it is something you want to allow users to index not-yet-indexed object couldn't it be a flag or something on the LuceneQuery ? e.g. s.createLuceneQuery(from X as x where x).setIndex(true) or maybe .setIndex(IndexMode.ONLY_NEW); design considerations: The delegation vs subclassing strategy for LuceneSession (ie LuceneSession delegating to a regular Session allowing simple wrapping or the LuceneSessionImpl being a subclass of SessionImpl is an ongoing discussion. Using a subclassing model would allow the LuceneSession to keep operation queues (for batch indexing either through object changes or through session.index() ), but it does not allow a potential Hibernate - XXX integration on the same subclassing model. Batching is essential in Lucene for performance reasons. Using the delegation model requires some SessionImpl modifications to be able to keep track of a generic context. This context will keep the operation queues. *ToDo* Argue on the LuceneSession design are pick up one (Steve/Emmanuel/Feel free to join the danse) I vote for a impl that will allow an existing Session to be the basis of extension; thus not having Lucene integrating be a hardcoded subclasswe did the same for Configuration and that is smelly/inflexible. We should open enough of the session up to allow such delegation. This might be extremely hard and close to impossible, but that is what I wish for ;) Find a way to keep the DocumentBuilder (sort of EntityPersister) at the SessionFactory level rather than the EventListener level (Steve/Emmanuel) Finding a way of storing structured info/data
Re: [Hibernate] Hibernate Lucene integration
I've worked a lot recently on the Hibernate Lucene integration. Here are the concepts, the new features and the todo list. Please comment and give feedbacks. My work is commited in branches/Lucene_Integration because we'll probably need to be based on Hibernate 3.3 *Concepts* Each time you change an object state, the lucene index is updated and kept in sync. This is done through the Hibernate event system. Whether an entity is indexed or not and whether a property is indexed or not is defined through annotations. You can also search through your domain model using Lucene and retrieve managed objects. The whole idea here is to do a nice integration between the search engine and the ORM without loosing the search engine power, hence most of the API remains. To sum up, query Lucene, get managed object back. *Mapping* A given entity is mapped to an index. A lucene index is stored in a Directory, a Directory is a Lucnee abstract concept for index storage system. It can be a memory directory (RAMDirectory), a file system directory (FSDirectory) or any other kind of backend. Hibernate Lucene introduce the notion of DirectoryProvider that you can configure and define on a per entity basis (and wich is defaulted defaulted). The concept is very similar to ConnectionProvider. Lucene only works with Strings, so you can define a @FieldBridge which transform a java property into a Lucene Field (and potentially vice-versa). A more simple (useful?) version handle the transformation of a java property into a String. Some built-in FieldBrigde exists. @FieldBridge is very much like an Hibernate Type. Esp I introduced the notion of precision in dates (year, month, .. second, millisecond). This FieldBridge and StringBridge gives a lot of flexibility in the way to design the property indexing. *Querying* I've introduced the notion of LuceneSession which implements Session and actually delegates to a regular Hibernate Session. This lucene session has a /createLuceneQuery()/ method and a /index()/ method. /session.createLuceneQuery(lucene.Query, Class[])/ takes a Lucene query as a parameter and the list of targeted entities. Using a Lucene query as a parameter gives the full Lucene flexibility (no abstraction on top of it). An /org.hibernate.Query/ object is returned. You can (must) use pagination. A Lucene query also return the number of matching results (regardless of the pagination): query.resultSize() sort of count(*). /list()/ returns the list of matching objects. It heavily depends on batch-size to be efficient (ie the proxy are created for all the results and then we initialize them. There might be alternative strategies here (select ... where id in ( , , , ) ), but the real benefit would come if combined with the dynamic fetching profile we talked about a while ago. /iterate()/ has the same semantic as the regular method in hibernate, meaning initialize the objects one by one. /scroll()/ allows an efficient navigation into the resultset, (objects are loaded one by one though). Having the dynamic fetch profile would definitely be a killer pair (searching the lucene index, and fetching the appropriate object graph) /session.//index(Object)/ is currently not implemented it requires some modifications of SessionImpl or of LuceneSession. This feature is useful to initialize / refresh the index in a batch way (ie loading the data and applying the indexing process on this set of data). Basically the object is added to the index queue. At flush() time, the queue is processed. design considerations: The delegation vs subclassing strategy for LuceneSession (ie LuceneSession delegating to a regular Session allowing simple wrapping or the LuceneSessionImpl being a subclass of SessionImpl is an ongoing discussion. Using a subclassing model would allow the LuceneSession to keep operation queues (for batch indexing either through object changes or through session.index() ), but it does not allow a potential Hibernate - XXX integration on the same subclassing model. Batching is essential in Lucene for performance reasons. Using the delegation model requires some SessionImpl modifications to be able to keep track of a generic context. This context will keep the operation queues. *ToDo* Argue on the LuceneSession design are pick up one (Steve/Emmanuel/Feel free to join the danse) Find a way to keep the DocumentBuilder (sort of EntityPersister) at the SessionFactory level rather than the EventListener level (Steve/Emmanuel) Implement the use of FieldBridge for all properties. It is currently used for the id property only (trivial). Batch changes: to do that I need to be able to keep a session related queue of all insert/update changes. I can't in the current design because SessionImpl does not have such concept and because the LuceneSession is build on the delegation model. We