Re: Payloads and TrieRangeQuery

Michael McCandless Thu, 11 Jun 2009 12:34:12 -0700

On Thu, Jun 11, 2009 at 8:46 AM, Yonik Seeley<[email protected]> wrote:


>>> Really goes into Solr land... my pref for Lucene is to remain a core
>>> expert-level full-text search library and keep out things that are
>>> easy to do in an application or at another level.
>>
>> I think this must be the crux of our disagreement.
>
> Indeed.  The itch to scratch w.r.t Solr in Lucene is increased core
> functionality, not more magic (that duplicates what Solr already does,
> but just in a different way and thus makes the lives of Solr
> developers harder).

But... Solr's needs are very different from direct users of Lucene.

I completely agree that Solr needs & wants only the low-level APIs in
Lucene, a raw engine, that doesn't bother with good defaults,
consumability, etc.  Just the raw stuff.  If Lucene existed only for
Solr, we'd be done here.

But Lucene is used by many direct users, and those users benefit from
good defaults & consumability.

For example, Solr would presumably prefer that trie* remain in contrib?

There's a single set of Solr developers, but a very wide range of
direct Lucene users.  I don't see how Lucene having good consumability
actually makes Solr's life harder.  Those raw APIs would still be
accessible to Solr...  simple things should be simple (direct Lucene
users) and complex things should be possible (Solr).

BTW, I don't mean to "pick" on trie*; I think there are many other
examples where we could improve Lucene's consumability.  EG, for
highlighter, you should pretty much always use its SpanScorer; yet,
it's completely non-obvious (even having read the javadocs) how to do
so.  Why isn't this the default scorer for Highlighter?

Such a situation doesn't affect Solr: you all are experts on all
aspects of Lucene, and you can figure it out.  But your average user
will do the obvious thing but then notice highlighting for phrase
searches is buggy, and conclude Lucene is buggy and go and
prefer the other search engine they are testing.  It's a trap.

Lucene's consumability is important.

> If we asked on java-user about people's priorities/wishes, I bet
> column stride fields, near real time indexing, and better performance
> would dominate stuff like not having to specify how to sort a field.

I think all of the above are important :)

>> I feel, instead, that Lucene should stand on its own, as a useful
>> search library, with a consumable API, good defaults, etc.  Lucene is
>> more than "the expert level search API that's embedded in
>> Solr". Lucene is consumed directly by apps other than Solr.
>>
>> In fact, I think there are many things in Solr that naturally belong
>> in Lucene (and over time we've been gradually slurping them down).
>> The line/criteria has always been rather blurry...
>
> And conversely, Solr isn't just a wrapper around Lucene and an
> incubator for Lucene technology.

Of course not: there's lots of good stuff in Solr that should stay in
Solr.

But eg the neat analyzers/tokenizers, search filters, faceted nav,
custom collectors, function queries (now diverged), CharFilter (in
progress), improvements to highlighter, etc., should really all be in
Lucene instead (as "modules")?

> Ask Lucene users if they would like pretty much any substantial piece
> of functionality in Solr moved to Lucene as a module and you'll
> probably get an affirmative answer.  But moving something from Solr to
> Lucene can have a lot of negative effects for Solr, including taking
> it out of the hands of Solr committers who aren't Lucene committers,

We should simply make such Solr committers Lucene committers, if they
are indeed working on stuff that should be in Lucene?

> and taking it out of Solr's release cycle and easy ability to change -
> if Solr needs to make a change to one of the moved classes, it's
> necessary to get it through the Lucene change process and then upgrade
> to the latest Lucene trunk - all or nothing.

"Getting through Lucene's change process" should be real simple for
you all :)  And, Solr upgrades Lucene's JAR fairly often already?

> It's also the case that the goals of Lucene classes and Solr classes
> are often very different.  Lucene is more concerned with Java APIs (as
> should be the case), while they are a bit more secondary in Solr...
> the external APIs are of primary importance and one doesn't worry as
> much (or at all) about the classes implementing that interface or it's
> Java API back compatibility (as a generalization... it depends on the
> class).

Different, yes, but not incompatible?

>> In Lucene, we should be able to add a NumericField to a document,
>> index it, and then create RangeFilter or Sort on that field and have
>> things "just work".
>
> That feels like a false sense of simplicity, and Lucene isn't for
> dummies ;-)  One needs to understand how things work under the hood to
> avoid shooting oneself in the foot.  You need to understand the memory
> implications of sorting on different fields, and you need to
> understand that to sort on a text field, there really needs to be just
> one token per field.  You need to understand that the way Trie is
> indexed, and that multiple values per field won't work if you use a
> precision step less than the word size.

I don't think it's a false simplicity; it's a true one.  NumericField
would enforce one token, during indexing.  SortField("X") would do the
right thing.  The default precisionStep would be good enough.  Yes,
sorting uses memory, and we can't help that (for now) so we have to
document it.  The vast majority of times when Lucene is used to do
numeric sorting/range filtering would be met by these simple
defaults.

Admittedly, since Lucene doesn't have any means of recording "this was
a trie field" in the index today, making such a change is a big change
and I don't think we should hold up trie's migration into core because
of this.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Payloads and TrieRangeQuery

Reply via email to