It seems like there might be several simple "internalizing" speedups, even before tackling the view server protocol or the couchjs view server, hinted at by Alexander's suggestion:
On Fri, Aug 16, 2013 at 3:58 PM, Alexander Shorin <[email protected]> wrote: > Idea: move document metadata into separate object. ... > Case 2: Large docs. Profit in case when you have set right fields into > metadata (like doc type, authorship, tags etc.) and filter first by > this metadata - you have minimal memory footprint, you have less CPU > load, rule "fast accept - fast reject" works perfectly. For the simple case of filtering which fields are passed to the map fn, you don't need full blown chained views, you only need a simple way to define field filters (describing which fields are the relevant "metadata" fields). > Side effect: it's possible to autoindex metadata on fly on document > update without asking user to write (meta/by_type, meta/by_author, > meta/by_update_time etc. viiews) . Sure, as much metadata you have as > large base index will be. In 80% cases it will be no more than 4KB. Similarly to how the internals of couch already optimize away the case when you have multiple views in the same design doc share the same map function (but different reductions), we should also be able to optimize away the case where multiple views share the same fields filter. > Resume: probably, I'd just described chained views feature with > autoindexing by certain fields (: One lesson I learned when I looked into implementing chained map/reduce views is that they will need to be in different design_docs from the parent views, in order to play nicely with BigCouch. Keeping them in the same design_doc just doesn't work with parallel view builds (at least, not without breaking normal design_doc considerations). So although I really like the simplicity of the "keep chained views in one design doc" approach, it's probably a dead-end. > Removing autoindexing feature and we could make views building process > much more faster if we make right views chain which will use set > algebra operations to calculate target doc ids to pass to final view: > reduce docs before map results: > > { > "views": { > "posts": {"map": "...", "reduce": "..."}, > "chain": [ > ["by_type", {"key": "post"}], > ["hidden", {"key": false}], > ["by_domain", {"keys": ["public", "wiki"]}] > ] > } > } I was inspired by your view syntax and thought I'd put forward my own similar proposal: { "_id": "plain_old_views_for_comparison", "views": { "single_emit": { "map": "function(doc) { if (!doc.foo) { emit([doc.bar, doc.baz], doc.quux); } }", "reduce": "_count" }, "multiple_emits": { "map": "function(doc) { if (!doc.foo) { emit([0, doc.bar], doc.quux); emit(['baz', doc.baz], doc.quux); } }", "reduce": "_count" }, } { "_id": "internalized", "options": { "filter": "!foo", "fields": ["bar", "baz", "quux"] }, "views": { "single_emit_1": { "map": "function(doc) { emit([doc.bar, doc.baz], doc.quux); }", "reduce": "_count" }, "single_emit_2": { "map": { "key": ["bar", "baz"], "value": "quux" }, "reduce": "_count" }, "multiple_emits": { "map": { "emits": [[0, "bar"], "quux"], ["'baz'", "baz"], "quux"]] }, "reduce": "_count" }, } Where the above views should behave the same way. The view options would support "filter" as a guard clause and "fields" to strip out all but the relevant metadata. These should be defined at the design document level to simplify working with the current view server protocol. And the view "map" could optionally be an object describing the emit values instead of a function string. The filter string should be simple but powerful: I'd suggest supporting !, &&, ||, (), "foo.bar.baz", >, <, >=, <=, ==, !=, numbers, and strings (for "type == 'foo'"). But even if all it supported was "foo" and "!foo", it would still be useful. In some cases, this will prevent most docs from ever needing to be evaluated by the view server. The "fields" array might also consider filtering nested fields like with "foo.bar.baz". The "filter" and internal map ("key", "value", "emits") should support the same values that "fields" supports plus numbers and strings, or they could support the same syntax as "filter" to do things like "key": ["!!deleted_at", "deleted_at"]. The "filter" and internal map would be able to use all of the fields, not just the ones defined in the options. Another odd case where I've personally noticed indexing speed get immensely bogged down is when the reduce function merges the map objects together. I've seen views with this problem go up to 5GB during initial load and compact back down to 20MB. I've documented this problem and my workaround here: https://gist.github.com/nevans/5512593. The hideously reduce pattern in that gist has resulted in 2-5x faster view builds for me (small DBs infinitesimally slower, huge speedup for big DBs). But it would be *much* better to simply add a "minimum_reduced_group_level" option to the view, and let the Erlang handle that without doing unnecessary view server round trips and hideously complicated reduce functions. Any group_level below the minimum_reduced_group_level would simply return "null" for all of the values. This isn't a trivial proposal, but it can be implemented completely independently of any view server protocol or couchjs changes. And even a simplified version could still yield major speedup for some of the most common map patterns, just as "_sum" and "_count" speed up the most common reduce functions. Also, the individual pieces can be implemented independently: If I were to work on this myself (probably not going to happen in the next month or two), I'd do "minimum_reduced_group_level" first and "filter" second, since I think that's were *my* biggest bang for the buck would be. But other people's dataset (e.g. large docs) might get the biggest improvement from "fields". And if you have lots of simple map functions, you might get the biggest speedup from the internal map "key", "value", "emits". What do you think? Ugly and untenable? Or a shot in the right direction? Also, I know that Jason already yielded on the O(N) argument, but I got here late and wanted to add my $0.02: Obviously anything better than O(N) is impossible when you need to map N documents. Changing to O(N/Q) (where Q=parallelism of view indexing; e.g. throw hardware at it) is still essentially O(N), but it's very useful and something that BigCouch does nicely. A 9x speedup might be the difference between a rollout taking 90 hours (barely finishes over the weekend) and 10 hours (you can do it overnight during the week). The longer the view rollout period, the slower and more cautious the development/deployment cycle becomes. More importantly, it might be the difference between loading a large user in 9 hours vs 60 minutes, which will feel like a qualitative improvement to that user and is especially important when that user is e.g. Walt Mossberg and load time is one of two nitpicks he has in his review. Or when you have a hundred similarly jumbo-sized users sign up the next day. Sorry for piling on after the argument is over. :) -- Nick
