Re: Ocean Documentation

Ian Boston Tue, 15 Jul 2008 03:12:33 -0700

If you are looking for another example along the same principals,(but considerably less sophisticated :) ), see

https://source.sakaiproject.org/svn//search/trunk/search-impl/

This manages a queue of change events on items to be indexed, thatqueue is processed by a cluster of search indexer machines thatcreates a stream of transaction logs containing segments. The searchserver nodes take that stream of transaction logs and merge them intotheir local indexes, optimizing periodically.

The code is based on Lucene 1.9.1 (waiting for an upgrade) and soperforms its own transactions above the lucene layer (no transactionsin 1.9.1 :( ).

It also performs consolidation of the transaction log of segments toreduce the cost of adding a new search node, although this is ratherexpensive.

The main difference between this an Jackrabbit (which we also use asa JCR) is that Jackrabbit performs indexing on each node, injectingdirectly into the lucene index, whereas this parallelizes theindexing operation. So the lag between a item appearing in theJackrabbit index is very low, typically < 1s, but the cpu load ofindexing in not scalable with the number of indexing nodes. Thedownside of parallel indexing is the delay, as documents need to bebatched to avoid excessive merge activity, and the network bandwidthconsumed by the transaction log and snapshots.

The method we used would never work for Jackrabbit as it uses thesearch index for query parsing (JCR-SQL and JCR-XQuery)...... andIMHO, the Jackrabbit approach is more elegant... but it would be niceto have it parallelize the indexing operation.


Hope that gives some contrast.
Ian

BTW, I understand Lucene 2.3 is much faster than 1.9, so I shouldupgrade?



On 14 Jul 2008, at 22:05, Jason Rutherglen wrote:

I took a look at Jackrabbit, which are a very cool animal, andthere are similar ideas in the Lucene portion. I will try to takea look at the source to get a better understanding.

On Fri, Jul 11, 2008 at 9:09 AM, Ard Schrijvers<[EMAIL PROTECTED]> wrote:

Hello Jason et al,

Indeed there are plenty of usecases of instantly needed updated
searches, for example the jsr-170 (jcr) compliant Jackrabbit
implementation: it havily relies on lucene for searching and hierarchy

resolving, and according jsr-170 spec after a save(), changes needto be

visible instantly.

Also, I think a very similar solution to yours is implementedthere: See

[1] if you like

Regards Ard

[1] http://jackrabbit.apache.org/index-readers.html



> I started a wiki name at
> http://wiki.apache.org/lucene-java/OceanRealtimeSearch linked
> from http://wiki.apache.org/lucene-java/LuceneResources.
>
> Perhaps I should add some background on the wiki.  I can add
> a little bit here.  I was an early Solr developer/user at a
> social networking company when Google's GData came out.  It
> looked similar to Solr so I took a look at it.  The one thing
> it had over Solr was realtime updates or the ability to add,
> delete, or update a document and be able to see the update in
> search results immediately.  With Solr the company had
> decided on a 10 minute interval of updating the index with
> delta updates from an Oracle database.  I wanted to see if it
> was possible with Lucene to create an approximation of what
> GData does.  The result is Ocean.
>
> The use case it was designed for is websites with dynamic
> data, some of which are social networking, photo sites,
> discussions boards, blogs, wikis, and such.  More broadly it
> is possible to use Ocean with any application that requires
> the database like feature of immediate updates.  Probably the
> best example of this is all of Google's web applications,
> outside of web search, uses a GData interface.  Meaning the
> primary datastore is not mysql or some equivalent, it is a
> proprietary search based database.  The best example of this
> is Gmail.  If I receive an email through Gmail I can also
> search on it immediately, there is no 10 minute delay.  Also
> in Gmail I can change labels, a common example being changing
> unread emails to read in bulk.  Presumably Gmail is not
> reindexing the entire email for each label change.
>
> Most highly trafficked web applications do not use the
> relational facilities like joins because they are too
> expensive.  Lucene does not offer joins so this is fine.  The
> only area Lucene is currently weak in is range queries.
> Mysql uses a btree index whereas Lucene uses the time
> consuming TermEnum and TermDocs combination.  This is an area
> Tag Index addresses.
>
> The way Ocean is designed there should be no limitations to
> using it compared to using Lucene IndexWriter.  It offers the
> same functionality.  If one does not want to use the
> transaction log Ocean offers because one simply wants to
> index 1 million documents at once, Ocean offers what is a
> called a LargeBatch.  It is a way to perform a large number
> of updates taking advantage of the new IndexWriter speedup,
> combined with transactional semantics.
>
> Karl, does this answer your question or are there areas that
> could use more explanation?
>
>
> On Fri, Jul 11, 2008 at 6:20 AM, Karl Wettin
> <[EMAIL PROTECTED]> wrote:
>
>
>
>       10 jul 2008 kl. 22.08 skrev Jason Rutherglen:
>
>
>
>               Is there a good place to put Ocean
> https://issues.apache.org/jira/browse/LUCENE-1313
> documentation?  Is there a place on the wiki that is good?
>
>
>
>       Hi Janson,
>
>       the wiki is just fine.
>
>       I've been reading the docs and looked at your patch.
> There is a lot of text about how it does what it does, but it
> says nothing anything about the intended use. I honestly
> don't even know what you mean by "real time search". You will
> probably get more attention if the documentation starts out
> with some use cases or thoughts on when and why it might make
> sense to use your code.
>
>
>             karl
>
>

>---------------------------------------------------------------------

>       To unsubscribe, e-mail: [EMAIL PROTECTED]

> For additional commands, e-mail: java-dev-[EMAIL PROTECTED]

>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Ocean Documentation

Reply via email to