Re: Minimizing the number of stored fields for Solr

Andrzej Bialecki Sat, 03 Jul 2010 03:13:53 -0700

On 2010-07-03 10:00, Doğacan Güney wrote:

Hey everyone,


This is not really a proposition but rather something I have been wondering
for a while so I wanted to see what everyone is
thinking.

Currently in our solr backend, we have "stored=true indexed=false" fields
and "stored=true indexed=true" fields. The former
class of fields are mostly used for storing digest, caching information etc.
I suggest that we get rid of all "indexed=false" fields and
read all such data from storage backend.

For the latter class of fields (i.e., stored=true indexed=true), I suggest
that we set them to stored=false for everything but "id" field. As an
example currently title is stored/indexed in solr while text is only indexed
(thus, will need to be fetched from storage backend). But for hbase
backend, title and text are already stored close together (in the same
column family) so performance hit of reading just text or reading both
will likely be same. And removing storage from solr may lead to better
caching of indexed fields and may lead to better example.

What does everyone think?

The issue is not as simple as it looks. If you want to have a goodperformance for searching & snippet generation then you still need tostore some data in stored fields - at least url, title, and plain text(not to mention the option to use term vectors in order to speed up thesnippet generation). Solr functionality can be also impaired by a lackof data available directly from Lucene storage (field cache, faceting,term vector highlighting).

Some fields of course are not useful for display, but are used forsearching only (e.g. anchors). These should be indexed but not stored inSolr. And it's ok to get them from non-solr storage if requested,because it's a rare event. The same goes for the full raw content, ifyou want to offer a "cached" view - this should not be stored in Solrbut instead it should come from a separate layer (note that sometimescached view might not be in the original format - pdf, office, etc - andinstead an html representation may be more suitable, so in general thecached view shouldn't automatically equal the original raw content).

But for other fields I would argue that for now they should remainstored in Solr, *even the full text*, until we figure out how theyaffect the ability and performance of common search operations. E.g. ifwe remove the stored "title" field then we need to reach to the storagelayer in order to display each page of results... not to mention issueslike highlighting, faceting, function queries and a host of otherfunctionalities that Solr can offer just because a field is stored inits index.

So I'm -0 to this proposal - of course we should review our schema, andof course we should have a mechanism to get data from the storage layer,but what you propose is IMHO a premature optimization at this point.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Minimizing the number of stored fields for Solr

Reply via email to