Tatu Saloranta writes:
> On Wednesday 19 March 2003 01:44, Morus Walter wrote:
> 
> This might still be a feasible thing to do, except if number of collections 
> changes very frequently (as you need to reindex all docs, not just 
> incremental).
> 
Well the number is slowly growing. 

> Another possibility would be to have a new kind of Query; one to use with 
> numeric field values (probably would be easiest to use hex numbers). In a way 
> it'd be a specialized/optimized version of WildcardQuery.
> 
> For example, one could define required bit pattern after ORing field value 
> with mask (in your case you'd use one bit per type, and require 
> non-interesting type flags to be zeroes, knowing that then at least one other 
> bit, matching interesting type, is one).

Actually, that's what we are doing right now with our current fulltext engine. 
This engine supports bitmasks in the way, that you can use bitwise and 
operations between an 32 bit index field and a query term. This counts as
a match, when the result is not zero. And a logical and can be computed
very fast.

> Implementing this would be fairly easy; first find the range (like RangeQuery 
> does), and iterate over all existing terms in that range, and for each match 
> against bit pattern, and add term if it matches the pattern.
> 
> Actual search would then search pretty much like prefix, wildcard or range 
> query, as Terms at that point have been expanded and search part need not 
> care how they were obtained.
> 
> This would make representation more compact (4 bits in a char instead of one), 
> potentially making index bit smaller (which usually also means faster). And 
> of course if you really want to push the limit, you could use even more 
> efficient encoding (although, assuming indexes use UTF-8, base64 might be 
> almost as efficient as it gets, as ascii chars only take one byte whereas 
> upper chars take anywhere from 2 to 7 [for unicode-3? 4 for UC2] bytes).
> 
Ok. The advantage would be a shorter index. But there isn't an advantage
in how the query is executed.
So I could get a similar advantage from my aproach 1 by simply using two 
character combinations for the collection name and creating a list of
OR combined queries. If I use 2 characters, 0-9 and a-z I get 1369
possible combinations, which should be sufficient for quite some time.
> 
> Anyway, just an idea I thought might be worth sharing,
> 
Well, that's what I was looking for :-)

Thanks a lot,
       Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to