On 2010-07-03 10:00, Doğacan Güney wrote:
Hey everyone,

This is not really a proposition but rather something I have been wondering
for a while so I wanted to see what everyone is
thinking.

Currently in our solr backend, we have "stored=true indexed=false" fields
and "stored=true indexed=true" fields. The former
class of fields are mostly used for storing digest, caching information etc.
I suggest that we get rid of all "indexed=false" fields and
read all such data from storage backend.

For the latter class of fields (i.e., stored=true indexed=true), I suggest
that we set them to stored=false for everything but "id" field. As an
example currently title is stored/indexed in solr while text is only indexed
(thus, will need to be fetched from storage backend). But for hbase
backend, title and text are already stored close together (in the same
column family) so performance hit of reading just text or reading both
will likely be same. And removing storage from solr may lead to better
caching of indexed fields and may lead to better example.

What does everyone think?


The issue is not as simple as it looks. If you want to have a good performance for searching & snippet generation then you still need to store some data in stored fields - at least url, title, and plain text (not to mention the option to use term vectors in order to speed up the snippet generation). Solr functionality can be also impaired by a lack of data available directly from Lucene storage (field cache, faceting, term vector highlighting).

Some fields of course are not useful for display, but are used for searching only (e.g. anchors). These should be indexed but not stored in Solr. And it's ok to get them from non-solr storage if requested, because it's a rare event. The same goes for the full raw content, if you want to offer a "cached" view - this should not be stored in Solr but instead it should come from a separate layer (note that sometimes cached view might not be in the original format - pdf, office, etc - and instead an html representation may be more suitable, so in general the cached view shouldn't automatically equal the original raw content).

But for other fields I would argue that for now they should remain stored in Solr, *even the full text*, until we figure out how they affect the ability and performance of common search operations. E.g. if we remove the stored "title" field then we need to reach to the storage layer in order to display each page of results... not to mention issues like highlighting, faceting, function queries and a host of other functionalities that Solr can offer just because a field is stored in its index.

So I'm -0 to this proposal - of course we should review our schema, and of course we should have a mechanism to get data from the storage layer, but what you propose is IMHO a premature optimization at this point.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to