Re: When should include_docs be used?

Volker Mische Mon, 16 Jun 2014 00:53:18 -0700

Hi Tito,

On 06/16/2014 06:35 AM, Tito Ciuro wrote:

Hi,


I've been using CouchDB for a while and now I'm evaluating Couchbase.
I'm wondering what's the best way to determine when to emit data vs
null. I typically avoid emitting the whole document is it's too "large"
(i.e. 1 MB or so) because the index would grow way too much. In this
case, I tend to emit null and then collect the documents via
Include_docs. However, if the data set is small (or all I need is a
subset of the document, then I emit this subset, as it's faster and puts
less strain on the storage system. There is also the potential for a
race condition. As per CouchDB's documentation
(http://wiki.apache.org/couchdb/HTTP%5Fview%5FAPI)

    The include_docs option will include the associated document.
    However, the user should keep in mind that there is a race condition
    when using this option. It is possible that between reading the view
    data and fetching the corresponding document that the document has
    changed. If you want to alleviate such concerns you should emit an
    object with a _rev attribute as in emit(key, {"_rev": doc._rev}).
    This alleviates the race condition but leaves the possibility that
    the returned document has been deleted (in which case, it includes
    the "_deleted": true attribute). Note: include_docs will cause a
    single document lookup per returned view result row. This adds
    significant strain on the storage system if you are under high load
    or return a lot of rows per request. If you are concerned about
    this, you can emit the full doc in each row; this will increase view
    index time and space requirements, but will make view reads
    optimally fast.

The Couchbase implementation for include_docs is different. If you usean SDK, it requests the view to get all the IDs and then it fetches thefull docs via a memcache GET. In the upcoming version of Couchbase (3.0)the original include_docs of the views will completely go away aand itwill only be supported through the SDKS (don't worry the API won'tchange when you use the SDKS).

Since Couchbase utilizes memcache, storing and retrieving data is a
whole different game: while in general a CouchDB document should not be
split and related into other documents (it's not a RDBMS!), it seems to
be perfectly fine in Couchbase. Because get/set/multiget are cheap
operations, it's perfectly feasible to "break" a document into smaller
pieces and retrieve them piecemeal. It seems this would be great for
memcache because it'd allow to cache the documents that are used the
most. On the other hand, keeping a document "monolithic" not only makes
the index larger, but it makes it less efficient to cache (it's an all
or nothing proposition.)

So it seems that a valid approach in Couchbase would be to:

1) break "large" documents into smaller, more manageable ones. Retrieve
them via get/multiget (cheap op) and let memcache cache them as
efficiently as possible.
2) emit small data subsets as needed, as opposed to the entire document
where possible.
3) for those queries where the entire document needs to be retrieved...
what then?:

     3.1) should we emit null and include_docs=true?
     3.2) should we emit the entire document instead?


You would emit null and let the SDK do the rest

It's clear that always emitting null in CouchDB puts a lot of pressure
on the storage system. But what about Couchbase? Are there any best
practices to be followed?


Do you mean "emittin the full document ...."?

Cheers,
  Volker

--
You received this message because you are subscribed to the Google Groups 
"Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: When should include_docs be used?

Reply via email to