Re: Lucene Scalability Question

J. Delgado Wed, 10 Jan 2007 12:37:59 -0800

No, Oracle Text does not use Lucene. It has its own proprietary
full-text engine. It represents documents, the inverted index and
relationships in a DB schema and it depends heavily on the SQL layer.
This has some severe limitations though...


Of course, you can push structured data into full-text based indexes.
We have seen how in Lucene we can represent some structured data types
(e.g. dates, numbers) as fields and perform some type of mixed queries
but the Lucene index, as some of you have pointed out, is not meant
for this and does not scale like a DB would.

I'm looking to hear new ideas people may have to solve this very hard problem.

-- Joaquin

2007/1/10, robert engels <[EMAIL PROTECTED]>:

I think the contrib 'Oracle Full Text' does this (although in the
reverse).

It uses Lucene for full text queries (embedded into the db), the
query analyzer works.

It is really a great piece of software. Do bad it can't be done in a
standard way so that it would work with all dbs.

I think it may be possible to embedded the Apache Derby to do
something like this, although this might be overkill. A simple b-tree
db might work best.

It would be interesting if the documents could be stored in a btree,
and a GUID used to access them (since the lucene docid is constantly
changing). The only stored field in a lucene Document would be the GUID.

On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:

> This is a more general question:
>
> Given the fact that most applications require querying a combination
> of full-text and structured data has anyone looked into building data
> structures at the most fundamental level  (e.g. combination of b-tree
> and inverted lists) that would enable scalable and performant
> structured (e.g.SQL or XQuery) + Full-Text queries?
>
> Can Lucene be taken as basis for this or do you recommend exploring
> other routes?
>
> -- Joaquin
>
> 2007/1/10, Chris Hostetter <[EMAIL PROTECTED]>:
>>
>> : So you mean lucene can't do better than this ?
>>
>> robert's point is that based on what you've told us, there is no
>> reason to
>> think Lucene makes sense for you -- if *all* you are doing is finding
>> documents based on numeric rnages, then a relational database is
>> petter
>> suited to your task.  if you accutally care about the tetual IR
>> features
>> of Lucene, then there are probably ways to make your searches
>> faster, but
>> you aren't giving us enough information.
>>
>> you said the example code you gave was in a loop ... but a loop
>> over what?
>> .. what cahnges with each iteration of the loop? ... if there are
>> RangeFilter's that ge reused more then once, CachingWrapperFilter
>> can come
>> in handy to ensure that work isn't done more often then it needs
>> to me.
>>
>> it's also not clear wether your query on "type:0" is just a
>> placeholder,
>> or indicative of what you acctually want to do in the long run ...
>> if all
>> of your queries are this simple, and all you care about is getting
>> a count
>> of things that have type:0 and are in your numeric ranges, then
>> don'g use
>> the "search" method at all, just put "type:0" in your
>> ChainedFilter and
>> call the "bits" method directly.
>>
>> you also haven't given us any information about wether or not you are
>> opening a new IndexSearcher/IndexReader every time you execute a
>> query, or
>> resuing the same instance -- reuse makes the perofrance much better
>> because it can reuse underlying resources.
>>
>> In short: if you state some performance numbers from timing some
>> code, and
>> want to know how to make that code faster, you have to actualy
>> show people
>> *all* of the code for them to be able to help you.
>>
>>
>> : >>  I still have the search problem I had before, now search
>> takes around
>> : >> 750
>> : >> msecs for a small set of documents.
>> : >>
>> : >>     [java] Total Query Processing time (msec) : 38745
>> : >>     [java] Total No. of Documents : 7,500,000
>> : >>     [java] Total No. of Executed queries : 50.0
>> : >>     [java] Execution time per query : 774.9 msec
>> : >>
>> : >>  The index is optimized and its size is 830 MB.
>> : >>  Each document has the following terms :
>> : >>     VSID(integer), data(float), type(short int) , precision
>> (byte).
>> : >>   The queries are generate in a loop similar to one below :
>> : >> loop ...
>> : >>     RangeFilter rq1 = new
>> : >> RangeFilter
>> ("data","+5.43243243440000","+5.43243243449999"true,true);
>> : >>     RangeFilter rq2 = new RangeFilter
>> : >> ("precision","+0001","+0002",true,true);
>> : >>     ChainedFilter cf = new ChainedFilter(new
>> : >> Filter[]{rq2,rq1},ChainedFilter.AND);
>> : >>     Query query = qp.parse("type:0");
>> : >>     Hits hits = searcher.search(query,cf);
>> : >> end loop
>> : >>
>> : >>  I would like to know if there exist any solution to improve
>> the search
>> : >> time ?  (I need to insert more than 500 million of these data
>> pages into
>> : >> lucene)
>>
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Scalability Question

Reply via email to