When should include_docs be used?

Tito Ciuro Sun, 15 Jun 2014 21:36:07 -0700

Hi,

I've been using CouchDB for a while and now I'm evaluating Couchbase. I'm 
wondering what's the best way to determine when to emit data vs null. I 
typically avoid emitting the whole document is it's too "large" (i.e. 1 MB 
or so) because the index would grow way too much. In this case, I tend to 
emit null and then collect the documents via Include_docs. However, if the 
data set is small (or all I need is a subset of the document, then I emit 
this subset, as it's faster and puts less strain on the storage system. 
There is also the potential for a race condition. As per CouchDB's 
documentation (http://wiki.apache.org/couchdb/HTTP%5Fview%5FAPI)


The include_docs option will include the associated document. However, the 
> user should keep in mind that there is a race condition when using this 
> option. It is possible that between reading the view data and fetching the 
> corresponding document that the document has changed. If you want to 
> alleviate such concerns you should emit an object with a _rev attribute as 
> in emit(key, {"_rev": doc._rev}). This alleviates the race condition but 
> leaves the possibility that the returned document has been deleted (in 
> which case, it includes the "_deleted": true attribute). Note: include_docs 
> will cause a single document lookup per returned view result row. This adds 
> significant strain on the storage system if you are under high load or 
> return a lot of rows per request. If you are concerned about this, you can 
> emit the full doc in each row; this will increase view index time and space 
> requirements, but will make view reads optimally fast.


Since Couchbase utilizes memcache, storing and retrieving data is a whole 
different game: while in general a CouchDB document should not be split and 
related into other documents (it's not a RDBMS!), it seems to be perfectly 
fine in Couchbase. Because get/set/multiget are cheap operations, it's 
perfectly feasible to "break" a document into smaller pieces and retrieve 
them piecemeal. It seems this would be great for memcache because it'd 
allow to cache the documents that are used the most. On the other hand, 
keeping a document "monolithic" not only makes the index larger, but it 
makes it less efficient to cache (it's an all or nothing proposition.)

So it seems that a valid approach in Couchbase would be to:

1) break "large" documents into smaller, more manageable ones. Retrieve 
them via get/multiget (cheap op) and let memcache cache them as efficiently 
as possible.
2) emit small data subsets as needed, as opposed to the entire document 
where possible.
3) for those queries where the entire document needs to be retrieved... 
what then?:

    3.1) should we emit null and include_docs=true?
    3.2) should we emit the entire document instead?

It's clear that always emitting null in CouchDB puts a lot of pressure on 
the storage system. But what about Couchbase? Are there any best practices 
to be followed?

Thanks,

-- Tito

-- 
You received this message because you are subscribed to the Google Groups 
"Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

When should include_docs be used?

Reply via email to