Re: Lucene Scalability Question

J. Delgado Wed, 10 Jan 2007 13:04:04 -0800

This sounds very interesting... I'll defenitely have a look into it.
However I have the feeling that, like the use of Oracle Text, this is
keeping seperate the underlying data structures used for evaluating
full-text and conditions over other data types, which brings up other
issues when trying to do full-blown mixed queries. Things get worse
when doing joins and other relational algebra operations.


I'm still wondering if the basic data structures should be revised to
achieve better performance...

-- Joaquin

2007/1/10, robert engels <[EMAIL PROTECTED]>:

There is a module in Lucene contrib that changes that! It loads
Lucene into the Oracle database (it has a JVM), and allows Lucene
syntax to perform full-text searching.

On Jan 10, 2007, at 2:37 PM, J. Delgado wrote:

> No, Oracle Text does not use Lucene. It has its own proprietary
> full-text engine. It represents documents, the inverted index and
> relationships in a DB schema and it depends heavily on the SQL layer.
> This has some severe limitations though...
>
> Of course, you can push structured data into full-text based indexes.
> We have seen how in Lucene we can represent some structured data types
> (e.g. dates, numbers) as fields and perform some type of mixed queries
> but the Lucene index, as some of you have pointed out, is not meant
> for this and does not scale like a DB would.
>
> I'm looking to hear new ideas people may have to solve this very
> hard problem.
>
> -- Joaquin
>
> 2007/1/10, robert engels <[EMAIL PROTECTED]>:
>> I think the contrib 'Oracle Full Text' does this (although in the
>> reverse).
>>
>> It uses Lucene for full text queries (embedded into the db), the
>> query analyzer works.
>>
>> It is really a great piece of software. Do bad it can't be done in a
>> standard way so that it would work with all dbs.
>>
>> I think it may be possible to embedded the Apache Derby to do
>> something like this, although this might be overkill. A simple b-tree
>> db might work best.
>>
>> It would be interesting if the documents could be stored in a btree,
>> and a GUID used to access them (since the lucene docid is constantly
>> changing). The only stored field in a lucene Document would be the
>> GUID.
>>
>> On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:
>>
>> > This is a more general question:
>> >
>> > Given the fact that most applications require querying a
>> combination
>> > of full-text and structured data has anyone looked into building
>> data
>> > structures at the most fundamental level  (e.g. combination of b-
>> tree
>> > and inverted lists) that would enable scalable and performant
>> > structured (e.g.SQL or XQuery) + Full-Text queries?
>> >
>> > Can Lucene be taken as basis for this or do you recommend exploring
>> > other routes?
>> >
>> > -- Joaquin
>> >
>> > 2007/1/10, Chris Hostetter <[EMAIL PROTECTED]>:
>> >>
>> >> : So you mean lucene can't do better than this ?
>> >>
>> >> robert's point is that based on what you've told us, there is no
>> >> reason to
>> >> think Lucene makes sense for you -- if *all* you are doing is
>> finding
>> >> documents based on numeric rnages, then a relational database is
>> >> petter
>> >> suited to your task.  if you accutally care about the tetual IR
>> >> features
>> >> of Lucene, then there are probably ways to make your searches
>> >> faster, but
>> >> you aren't giving us enough information.
>> >>
>> >> you said the example code you gave was in a loop ... but a loop
>> >> over what?
>> >> .. what cahnges with each iteration of the loop? ... if there are
>> >> RangeFilter's that ge reused more then once, CachingWrapperFilter
>> >> can come
>> >> in handy to ensure that work isn't done more often then it needs
>> >> to me.
>> >>
>> >> it's also not clear wether your query on "type:0" is just a
>> >> placeholder,
>> >> or indicative of what you acctually want to do in the long run ...
>> >> if all
>> >> of your queries are this simple, and all you care about is getting
>> >> a count
>> >> of things that have type:0 and are in your numeric ranges, then
>> >> don'g use
>> >> the "search" method at all, just put "type:0" in your
>> >> ChainedFilter and
>> >> call the "bits" method directly.
>> >>
>> >> you also haven't given us any information about wether or not
>> you are
>> >> opening a new IndexSearcher/IndexReader every time you execute a
>> >> query, or
>> >> resuing the same instance -- reuse makes the perofrance much
>> better
>> >> because it can reuse underlying resources.
>> >>
>> >> In short: if you state some performance numbers from timing some
>> >> code, and
>> >> want to know how to make that code faster, you have to actualy
>> >> show people
>> >> *all* of the code for them to be able to help you.
>> >>
>> >>
>> >> : >>  I still have the search problem I had before, now search
>> >> takes around
>> >> : >> 750
>> >> : >> msecs for a small set of documents.
>> >> : >>
>> >> : >>     [java] Total Query Processing time (msec) : 38745
>> >> : >>     [java] Total No. of Documents : 7,500,000
>> >> : >>     [java] Total No. of Executed queries : 50.0
>> >> : >>     [java] Execution time per query : 774.9 msec
>> >> : >>
>> >> : >>  The index is optimized and its size is 830 MB.
>> >> : >>  Each document has the following terms :
>> >> : >>     VSID(integer), data(float), type(short int) , precision
>> >> (byte).
>> >> : >>   The queries are generate in a loop similar to one below :
>> >> : >> loop ...
>> >> : >>     RangeFilter rq1 = new
>> >> : >> RangeFilter
>> >> ("data","+5.43243243440000","+5.43243243449999"true,true);
>> >> : >>     RangeFilter rq2 = new RangeFilter
>> >> : >> ("precision","+0001","+0002",true,true);
>> >> : >>     ChainedFilter cf = new ChainedFilter(new
>> >> : >> Filter[]{rq2,rq1},ChainedFilter.AND);
>> >> : >>     Query query = qp.parse("type:0");
>> >> : >>     Hits hits = searcher.search(query,cf);
>> >> : >> end loop
>> >> : >>
>> >> : >>  I would like to know if there exist any solution to improve
>> >> the search
>> >> : >> time ?  (I need to insert more than 500 million of these data
>> >> pages into
>> >> : >> lucene)
>> >>
>> >>
>> >>
>> >>
>> >> -Hoss
>> >>
>> >>
>> >>
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Scalability Question

Reply via email to