Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen Sun, 21 Sep 2008 10:33:42 -0700

Agreed, it's a system that is of value to a subset of cases.


On Sat, Sep 20, 2008 at 4:04 PM, Noble Paul നോബിള്‍ नोब्ळ्
<[EMAIL PROTECTED]> wrote:
> Moving back to RDBMS model will be a big step backwards where we miss
> mulivalued fields and arbitrary fields .
>
> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
> <[EMAIL PROTECTED]> wrote:
>> Cool.  I mention H2 because it does have some Lucene code in it yes.
>> Also according to some benchmarks it's the fastest of the open source
>> databases.  I think it's possible to integrate realtime search for H2.
>>  I suppose there is no need to store the data in Lucene in this case?
>> One loses the multiple values per field Lucene offers, and the schema
>> become static.  Perhaps it's a trade off?
>>
>> On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[EMAIL PROTECTED]> wrote:
>>> Yes, both Marcelo and I would be interested.
>>>
>>> We looked into H2 and it looks like something similar to Oracle's ODCI can
>>> be implemented. Plus the primitive full-text implementación is based on
>>> Lucene.
>>> I say primitive because looking at the code I saw that one cannot define an
>>> Analyzer and for each scan corresponding to a where clause a searcher is
>>> open and closed, instead of having a pool, plus it does not have any way to
>>> queue changes to reduce the use of the IndexWriter, etc.
>>>
>>> But its open source and that is a great starting point!
>>>
>>> -- Joaquin
>>>
>>> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>>> <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Perhaps an interesting project would be to integrate Ocean with H2
>>>> www.h2database.com to take advantage of both models.  I'm not sure how
>>>> exactly that would work, but it seems like it would not be too
>>>> difficult.  Perhaps this would solve being able to perform faster
>>>> hierarchical queries and perhaps other types of queries that Lucene is
>>>> not capable of.
>>>>
>>>> Is this something Joaquin you are interested in collaborating on?  I
>>>> am definitely interested in it.
>>>>
>>>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[EMAIL PROTECTED]>
>>>> wrote:
>>>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>>> > <[EMAIL PROTECTED]> wrote:
>>>> >>
>>>> >> Regarding real-time search and Solr, my feeling is the focus should be
>>>> >> on
>>>> >> first adding real-time search to Lucene, and then we'll figure out how
>>>> >> to
>>>> >> incorporate that into Solr later.
>>>> >
>>>> >
>>>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>>>> >  Note
>>>> > that Lucene, being a indexing/search library (and not a full blown
>>>> > search
>>>> > engine), is by definition "real-time": once you add/write a document to
>>>> > the
>>>> > index it becomes immediately searchable and if a document is logically
>>>> > deleted and no longer returned in a search, though physical deletion
>>>> > happens
>>>> > during an index optimization.
>>>> >
>>>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>>> > transaction and making these documents available for search immediately
>>>> > after the transaction is commited sounds more like a search engine
>>>> > problem
>>>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>>>> > be
>>>> > I/O expensive and thus are usually implemented bached proceeses with
>>>> > some
>>>> > kind of sync mechanism, which makes them non real-time.
>>>> >
>>>> > For example, in my previous life, I designed and help implement a
>>>> > quasi-realtime enterprise search engine using Lucene, having a set of
>>>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>>> > accross
>>>> > different search services which powered a broker based distributed
>>>> > search
>>>> > interface. The most recent documents provided to the indexers were
>>>> > always
>>>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>>>> > the
>>>> > load of a bulk "add" transaction and later would be merged into larger
>>>> > disk
>>>> > based indexes and then flushed to make them ready to absorbe new fresh
>>>> > docs.
>>>> > We even had further partitioning of the indexes that reflected time
>>>> > periods
>>>> > with caps on size for them to be merged into older more archive based
>>>> > indexes which were used less (yes the search engine default search was
>>>> > on
>>>> > data no more than 1 month old, though user could open the time window by
>>>> > including archives).
>>>> >
>>>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>>>> > engines are becomming more and more like relational databases with
>>>> > full-text
>>>> > search capablities (without the benefit of full reletional algebra --
>>>> > for
>>>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>>>> > operations and transactionality are core DB concepts adn have been
>>>> > studied
>>>> > and developed by database communities for aquite long time. There has
>>>> > been
>>>> > recent efforts on how to effeciently integrate Lucene into releational
>>>> > databases (see Lucene JVM ORACLE integration, see
>>>> >
>>>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>>>> >
>>>> > I think we should seriously look at joining efforts with open-source
>>>> > Database engine projects, written in Java (see
>>>> > http://java-source.net/open-source/database-engines) in order to blend
>>>> > IR
>>>> > and ORM for once and for all.
>>>> >
>>>> > -- Joaquin
>>>> >
>>>> >
>>>> >>
>>>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>>>> >> times to understand bits and pieces of it.  I have to admit there is
>>>> >> still
>>>> >> some fuzziness about the whole things in my head - is "Ocean" something
>>>> >> that
>>>> >> already works, a separate project on googlecode.com?  I think so.  If
>>>> >> so,
>>>> >> and if you are working on getting it integrated into Lucene, would it
>>>> >> make
>>>> >> it less confusing to just refer to it as "real-time search", so there
>>>> >> is no
>>>> >> confusion?
>>>> >>
>>>> >> If this is to be initially integrated into Lucene, why are things like
>>>> >> replication, crowding/field collapsing, locallucene, name service, tag
>>>> >> index, etc. all mentioned there on the Wiki and bundled with
>>>> >> description of
>>>> >> how real-time search works and is to be implemented?  I suppose
>>>> >> mentioning
>>>> >> replication kind-of makes sense because the replication approach is
>>>> >> closely
>>>> >> tied to real-time search - all query nodes need to see index changes
>>>> >> fast.
>>>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>>> >> replication
>>>> >> is something to figure out separately, say on the Solr level, later on
>>>> >> "once
>>>> >> we get there".  I think even just the essential real-time search
>>>> >> requires
>>>> >> substantial changes to Lucene (I remember seeing large patches in
>>>> >> JIRA),
>>>> >> which makes it hard to digest, understand, comment on, and ultimately
>>>> >> commit
>>>> >> (hence the luke warm response, I think).  Bringing other non-essential
>>>> >> elements into discussion at the same time makes it more difficult t o
>>>> >>  process all this new stuff, at least for me.  Am I the only one who
>>>> >> finds
>>>> >> this hard?
>>>> >>
>>>> >> That said, it sounds like we have some discussion going (Karl...), so I
>>>> >> look forward to understanding more! :)
>>>> >>
>>>> >>
>>>> >> Otis
>>>> >> --
>>>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>> >>
>>>> >>
>>>> >>
>>>> >> ----- Original Message ----
>>>> >> > From: Yonik Seeley <[EMAIL PROTECTED]>
>>>> >> > To: [email protected]
>>>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>>> >> >
>>>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>>> >> > wrote:
>>>> >> > > I also think it's got a
>>>> >> > > lot of things now which makes integration difficult to do properly.
>>>> >> >
>>>> >> > I agree, and that's why the major bump in version number rather than
>>>> >> > minor - we recognize that some features will need some amount of
>>>> >> > rearchitecture.
>>>> >> >
>>>> >> > > I think the problem with integration with SOLR is it was designed
>>>> >> > > with
>>>> >> > > a different problem set in mind than Ocean, originally the CNET
>>>> >> > > shopping application.
>>>> >> >
>>>> >> > That was the first use of Solr, but it actually existed before that
>>>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>>>> >> > based search servers (that's actually where some of the parameter
>>>> >> > names come from... the default /select URL instead of /search, the
>>>> >> > "rows" parameter, etc).
>>>> >> >
>>>> >> > But you're right... some things like the replication strategy were
>>>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>>>> >> > would be OK to have slightly "stale" views of the data in the range
>>>> >> > of
>>>> >> > minutes.  It just made things easier/possible at the time.  But tons
>>>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>>> >> > added
>>>> >> > documents, if they can get it.  It's hardly restricted to social
>>>> >> > network applications.
>>>> >> >
>>>> >> > Bottom line is that Solr aims to be a general enterprise search
>>>> >> > platform, and getting as real-time as we can get, and as scalable as
>>>> >> > we can get are some of the top priorities going forward.
>>>> >> >
>>>> >> > -Yonik
>>>> >> >
>>>> >> > ---------------------------------------------------------------------
>>>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> >> > For additional commands, e-mail: [EMAIL PROTECTED]
>>>> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>>>> >>
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
>
> --
> --Noble Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Realtime Search for Social Networks Collaboration

Reply via email to