On 2010-07-03 10:00, Doğacan Güney wrote:
Hey everyone,
This is not really a proposition but rather something I have been wondering
for a while so I wanted to see what everyone is
thinking.
Currently in our solr backend, we have "stored=true indexed=false" fields
and "stored=true indexed=true" fields. The former
class of fields are mostly used for storing digest, caching information etc.
I suggest that we get rid of all "indexed=false" fields and
read all such data from storage backend.
For the latter class of fields (i.e., stored=true indexed=true), I suggest
that we set them to stored=false for everything but "id" field. As an
example currently title is stored/indexed in solr while text is only indexed
(thus, will need to be fetched from storage backend). But for hbase
backend, title and text are already stored close together (in the same
column family) so performance hit of reading just text or reading both
will likely be same. And removing storage from solr may lead to better
caching of indexed fields and may lead to better example.
What does everyone think?
The issue is not as simple as it looks. If you want to have a good
performance for searching & snippet generation then you still need to
store some data in stored fields - at least url, title, and plain text
(not to mention the option to use term vectors in order to speed up the
snippet generation). Solr functionality can be also impaired by a lack
of data available directly from Lucene storage (field cache, faceting,
term vector highlighting).
Some fields of course are not useful for display, but are used for
searching only (e.g. anchors). These should be indexed but not stored in
Solr. And it's ok to get them from non-solr storage if requested,
because it's a rare event. The same goes for the full raw content, if
you want to offer a "cached" view - this should not be stored in Solr
but instead it should come from a separate layer (note that sometimes
cached view might not be in the original format - pdf, office, etc - and
instead an html representation may be more suitable, so in general the
cached view shouldn't automatically equal the original raw content).
But for other fields I would argue that for now they should remain
stored in Solr, *even the full text*, until we figure out how they
affect the ability and performance of common search operations. E.g. if
we remove the stored "title" field then we need to reach to the storage
layer in order to display each page of results... not to mention issues
like highlighting, faceting, function queries and a host of other
functionalities that Solr can offer just because a field is stored in
its index.
So I'm -0 to this proposal - of course we should review our schema, and
of course we should have a mechanism to get data from the storage layer,
but what you propose is IMHO a premature optimization at this point.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com