File format for views is space and time inefficient - use a better one
----------------------------------------------------------------------
Key: COUCHDB-623
URL: https://issues.apache.org/jira/browse/COUCHDB-623
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns
This was discussed on the dev mailing list over the last few days and noted
here so it isn't forgotten.
The main database file format is optimised for data integrity - not losing or
mangling documents - and rightly so.
That same append-only format is also used for views where it is a poor fit.
The more random the ordering of data supplied, the larger the btree. The
larger the keys (in bytes) the larger the btree. As an example my 2GB of raw
JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before
compacting to 900MB). Since views are not replicated, this requires a
disproportionate amount of disk space on each receiving server (not to mention
I/O load). The format also affects view generation performance. By loading my
documents into CouchDB in an order by the most emitted value in views I was
able to reduce load time from 75 minutes to 40 minutes with the view file size
being 15GB instead of 27GB, but still very distant from the 900MB post
compaction.
Views are a performance enhancement. They save you from having to visit every
document when doing some queries. The data within in a view is generated and
hence the only consequence of losing view data is a performance one and the
view can be regenerated anyway. Consequently the file format should be one
that is optimised for performance and size. The only integrity feature needed
is the ability to tell that the view is potentially corrupt (eg the power
failed while it was being generated/updated).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.