Erick Erickson created SOLR-6888:
------------------------------------

             Summary: Find a way to avoid decompressing entire blocks for just 
the docId/uniqueKey
                 Key: SOLR-6888
                 URL: https://issues.apache.org/jira/browse/SOLR-6888
             Project: Solr
          Issue Type: Improvement
    Affects Versions: 5.0, Trunk
            Reporter: Erick Erickson
            Assignee: Erick Erickson


Assigning this to myself to just not lose track of it, but I won't be working 
on this in the near term; anyone feeling ambitious should feel free to grab it.

Note, docId used here is whatever is defined for <uniqueKey>...

Since Solr 4.1, the compression/decompression process is based on 16K blocks 
and is automatic, and not configurable. So, to get a single stored value one 
must decompress an entire 16K block. At least.

For SolrCloud (and distributed processing in general), we make two trips, one 
to get the doc id and score (or other sort criteria) and one to return the 
actual data.

The first pass here requires that we return the top N docIDs and sort criteria, 
which means that each and every sub-request has to unpack at least one 16K 
block (and sometimes more) to get just the doc ID. So if we have 20 shards and 
only want 20 rows, 95% of the decompression cycles will be wasted. Not to 
mention all the disk reads.

It seems like we should be able to do better than that. Can we argue that doc 
ids are 'special' and should be cached somehow? Let's discuss what this would 
look like. I can think of a couple of approaches:

1> Since doc IDs are "special", can we say that for this purpose returning the 
indexed version is OK? We'd need to return the actual stored value when the 
full doc was requested, but for the sub-request only what about returning the 
indexed value instead of the stored one? On the surface I don't see a problem 
here, but what do I know? Storing these as DocValues seems useful in this case.

1a> A variant is treating numeric docIds specially since the indexed value and 
the stored value should be identical. And DocValues here would be useful it 
seems. But this seems an unnecessary specialization if <1> is implemented well.

2> We could cache individual doc IDs, although I'm not sure what use that 
really is. Would maintaining the cache overwhelm the savings of not 
decompressing? I really don't like this idea, but am throwing it out there. 
Doing this from stored data up front would essentially mean decompressing every 
doc so that seems untenable to try up-front.

3> We could maintain an array[maxDoc] that held document IDs, perhaps lazily 
initializing it. I'm not particularly a fan of this either, doesn't seem like a 
Good Thing. I can see lazy loading being almost, but not quite totally, 
useless, i.e. a hit ratio near 0, especially since it'd be thrown out on every 
openSearcher.


Really, the only one of these that seems viable is <1>/<1a>. The others would 
all involve decompressing the docs anyway to get the ID, and I suspect that 
caching would be of very limited usefulness. I guess <1>'s viability hinges on 
whether, for internal use, the indexed form of DocId is interchangeable with 
the stored value.

Or are there other ways to approach this? Or isn't it something to really worry 
about?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to