[jira] [Updated] (SOLR-6888) Find a way to avoid decompressing entire blocks for just the docId/uniqueKey

Erick Erickson (JIRA) Fri, 26 Dec 2014 13:03:33 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Erick Erickson updated SOLR-6888:
---------------------------------
    Attachment: SOLR-6888-hacktiming.patch

I did some crude hacking to get timings and it may cost a node about 10% [1] of 
its processing time to decompress in order to get the ID for the first pass of 
a distributed response. Pretty much driving my system at max CPU.

tl;dr

This isn't even a sketch of a patch, just a crude hack that allowed me to 
collect some timings. Preserving here for posterity, it shows where I _think_ 
we'd need to start looking if we change this.

I used DocValues for the <uniqueKey> field here just to make it easy to pull 
the values from the index rather than the stored value, certainly not something 
we'd want to require. Not to mention that I figured out whether to load from 
the index or the stored value based on whether the only two fields being 
requested were "id" and "score"........ Like I said, not even the sketch of a 
patch.

Anyway, details:
Ran tests on my macbook pro with an SSD. 4 shards, 11+M Wikipedia docs. Created 
about 10K queries that are just 5 random terms collected during indexing and 
OR'd together. Divided these amongst 5 threads running from SolrJ to collect 
some stats for 2 minutes. I create 5 lists of 2K queries each and pass one list 
to each worker thread. The algorithm is deterministic in that list 1 during run 
1 will have the same queries as list 1 during run 2. list1 will be different 
from list2 (or 3 or 4 or 5) _within_ a single run.

The numbers below are from the 2nd run to get past low-level cache filling and 
the like.

I did record the total number of hits in each thread for the first 1K queries, 
and they're identical for both spoofing and not spoofing so I'm encouraged that 
this approach isn't totally insane.

documentCache, filterCache and queryResultCache all have size==0

The interesting things I think are the nano times recorded for decompressing as 
opposed to spoofing by getting the ID from the index.

When spoofing, my 2 minute run (4 shards remember) spent about 15B nanoseconds 
decompressing and about 4B nanoseconds filling in from the indexed ids. [1]

When running normally, the time spent in decompressing was about 71B 
nanoseconds and, of course, 0 filling in from indexed IDs. [1]

So in round numbers, on this test decompressing for reading just the ID on the 
first pass cost each of my shards about 13 (52/4) seconds of a 2 minute run, or 
roughly 10%. My CPUs are pretty well pegged during querying...

All this within the usual caveats about how much you can trust 
System.nanoTime() and whether I collected the right things.

Additional notes:

1) This was probably best-case decompression, as I think each and every 
document fits in a 16K buffer, so only one buffer needs to be decompressed.

2) I didn't pursue it very thoroughly, but the mechanics of this might be 
accomplished by abusing the "onlyPseudoFields" bits of the code and treating 
index lookups as "super-extra-special pseudo fields".

3) I only used this for the "id" field. Could a similar argument be made for 
all the field-based sort criteria since there's only supposed to be 1 per doc? 

[1] Just noticed that my log files roll over every 4M, and I don't particularly 
want to untangle that now. But on a quick glance, it looks like they have about 
90 of 120 seconds of data, so these numbers may be low by about 1/3, i.e. it 
may be costing the shards 17-18 seconds of a 120 second run or approx. 14%.

Anyway, I'm not going to pursue this in the near term but thought it was worth 
preserving what I think I'm seeing.

> Find a way to avoid decompressing entire blocks for just the docId/uniqueKey
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-6888
>                 URL: https://issues.apache.org/jira/browse/SOLR-6888
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 5.0, Trunk
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>         Attachments: SOLR-6888-hacktiming.patch
>
>
> Assigning this to myself to just not lose track of it, but I won't be working 
> on this in the near term; anyone feeling ambitious should feel free to grab 
> it.
> Note, docId used here is whatever is defined for <uniqueKey>...
> Since Solr 4.1, the compression/decompression process is based on 16K blocks 
> and is automatic, and not configurable. So, to get a single stored value one 
> must decompress an entire 16K block. At least.
> For SolrCloud (and distributed processing in general), we make two trips, one 
> to get the doc id and score (or other sort criteria) and one to return the 
> actual data.
> The first pass here requires that we return the top N docIDs and sort 
> criteria, which means that each and every sub-request has to unpack at least 
> one 16K block (and sometimes more) to get just the doc ID. So if we have 20 
> shards and only want 20 rows, 95% of the decompression cycles will be wasted. 
> Not to mention all the disk reads.
> It seems like we should be able to do better than that. Can we argue that doc 
> ids are 'special' and should be cached somehow? Let's discuss what this would 
> look like. I can think of a couple of approaches:
> 1> Since doc IDs are "special", can we say that for this purpose returning 
> the indexed version is OK? We'd need to return the actual stored value when 
> the full doc was requested, but for the sub-request only what about returning 
> the indexed value instead of the stored one? On the surface I don't see a 
> problem here, but what do I know? Storing these as DocValues seems useful in 
> this case.
> 1a> A variant is treating numeric docIds specially since the indexed value 
> and the stored value should be identical. And DocValues here would be useful 
> it seems. But this seems an unnecessary specialization if <1> is implemented 
> well.
> 2> We could cache individual doc IDs, although I'm not sure what use that 
> really is. Would maintaining the cache overwhelm the savings of not 
> decompressing? I really don't like this idea, but am throwing it out there. 
> Doing this from stored data up front would essentially mean decompressing 
> every doc so that seems untenable to try up-front.
> 3> We could maintain an array[maxDoc] that held document IDs, perhaps lazily 
> initializing it. I'm not particularly a fan of this either, doesn't seem like 
> a Good Thing. I can see lazy loading being almost, but not quite totally, 
> useless, i.e. a hit ratio near 0, especially since it'd be thrown out on 
> every openSearcher.
> Really, the only one of these that seems viable is <1>/<1a>. The others would 
> all involve decompressing the docs anyway to get the ID, and I suspect that 
> caching would be of very limited usefulness. I guess <1>'s viability hinges 
> on whether, for internal use, the indexed form of DocId is interchangeable 
> with the stored value.
> Or are there other ways to approach this? Or isn't it something to really 
> worry about?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-6888) Find a way to avoid decompressing entire blocks for just the docId/uniqueKey

Reply via email to