If you are looking for another example along the same principals,
(but considerably less sophisticated :) ), see
https://source.sakaiproject.org/svn//search/trunk/search-impl/
This manages a queue of change events on items to be indexed, that
queue is processed by a cluster of search indexer machines that
creates a stream of transaction logs containing segments. The search
server nodes take that stream of transaction logs and merge them into
their local indexes, optimizing periodically.
The code is based on Lucene 1.9.1 (waiting for an upgrade) and so
performs its own transactions above the lucene layer (no transactions
in 1.9.1 :( ).
It also performs consolidation of the transaction log of segments to
reduce the cost of adding a new search node, although this is rather
expensive.
The main difference between this an Jackrabbit (which we also use as
a JCR) is that Jackrabbit performs indexing on each node, injecting
directly into the lucene index, whereas this parallelizes the
indexing operation. So the lag between a item appearing in the
Jackrabbit index is very low, typically < 1s, but the cpu load of
indexing in not scalable with the number of indexing nodes. The
downside of parallel indexing is the delay, as documents need to be
batched to avoid excessive merge activity, and the network bandwidth
consumed by the transaction log and snapshots.
The method we used would never work for Jackrabbit as it uses the
search index for query parsing (JCR-SQL and JCR-XQuery)...... and
IMHO, the Jackrabbit approach is more elegant... but it would be nice
to have it parallelize the indexing operation.
Hope that gives some contrast.
Ian
BTW, I understand Lucene 2.3 is much faster than 1.9, so I should
upgrade?
On 14 Jul 2008, at 22:05, Jason Rutherglen wrote:
I took a look at Jackrabbit, which are a very cool animal, and
there are similar ideas in the Lucene portion. I will try to take
a look at the source to get a better understanding.
On Fri, Jul 11, 2008 at 9:09 AM, Ard Schrijvers
<[EMAIL PROTECTED]> wrote:
Hello Jason et al,
Indeed there are plenty of usecases of instantly needed updated
searches, for example the jsr-170 (jcr) compliant Jackrabbit
implementation: it havily relies on lucene for searching and hierarchy
resolving, and according jsr-170 spec after a save(), changes need
to be
visible instantly.
Also, I think a very similar solution to yours is implemented
there: See
[1] if you like
Regards Ard
[1] http://jackrabbit.apache.org/index-readers.html
> I started a wiki name at
> http://wiki.apache.org/lucene-java/OceanRealtimeSearch linked
> from http://wiki.apache.org/lucene-java/LuceneResources.
>
> Perhaps I should add some background on the wiki. I can add
> a little bit here. I was an early Solr developer/user at a
> social networking company when Google's GData came out. It
> looked similar to Solr so I took a look at it. The one thing
> it had over Solr was realtime updates or the ability to add,
> delete, or update a document and be able to see the update in
> search results immediately. With Solr the company had
> decided on a 10 minute interval of updating the index with
> delta updates from an Oracle database. I wanted to see if it
> was possible with Lucene to create an approximation of what
> GData does. The result is Ocean.
>
> The use case it was designed for is websites with dynamic
> data, some of which are social networking, photo sites,
> discussions boards, blogs, wikis, and such. More broadly it
> is possible to use Ocean with any application that requires
> the database like feature of immediate updates. Probably the
> best example of this is all of Google's web applications,
> outside of web search, uses a GData interface. Meaning the
> primary datastore is not mysql or some equivalent, it is a
> proprietary search based database. The best example of this
> is Gmail. If I receive an email through Gmail I can also
> search on it immediately, there is no 10 minute delay. Also
> in Gmail I can change labels, a common example being changing
> unread emails to read in bulk. Presumably Gmail is not
> reindexing the entire email for each label change.
>
> Most highly trafficked web applications do not use the
> relational facilities like joins because they are too
> expensive. Lucene does not offer joins so this is fine. The
> only area Lucene is currently weak in is range queries.
> Mysql uses a btree index whereas Lucene uses the time
> consuming TermEnum and TermDocs combination. This is an area
> Tag Index addresses.
>
> The way Ocean is designed there should be no limitations to
> using it compared to using Lucene IndexWriter. It offers the
> same functionality. If one does not want to use the
> transaction log Ocean offers because one simply wants to
> index 1 million documents at once, Ocean offers what is a
> called a LargeBatch. It is a way to perform a large number
> of updates taking advantage of the new IndexWriter speedup,
> combined with transactional semantics.
>
> Karl, does this answer your question or are there areas that
> could use more explanation?
>
>
> On Fri, Jul 11, 2008 at 6:20 AM, Karl Wettin
> <[EMAIL PROTECTED]> wrote:
>
>
>
> 10 jul 2008 kl. 22.08 skrev Jason Rutherglen:
>
>
>
> Is there a good place to put Ocean
> https://issues.apache.org/jira/browse/LUCENE-1313
> documentation? Is there a place on the wiki that is good?
>
>
>
> Hi Janson,
>
> the wiki is just fine.
>
> I've been reading the docs and looked at your patch.
> There is a lot of text about how it does what it does, but it
> says nothing anything about the intended use. I honestly
> don't even know what you mean by "real time search". You will
> probably get more attention if the documentation starts out
> with some use cases or thoughts on when and why it might make
> sense to use your code.
>
>
> karl
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: java-dev-
[EMAIL PROTECTED]
>
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]