[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

Roger Binns (JIRA) Wed, 13 Jan 2010 12:38:19 -0800

    [ 
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799916#action_12799916
 ]


Roger Binns commented on COUCHDB-623:
-------------------------------------

Not again Damien :-)

Simple criteria - the size of the view file should be proportionate to the data 
in a view on initial generation.  If you want raw numbers, the view file should 
be no larger than double the sum of JSON encoded key, value and _id for each 
row.

The current multiplier is 15 to 27 times as much which is ludicrous.  Even post 
compactation the file is a little on the large side.  And because the view 
results are not replicated, the overhead has to be incurred on every machine 
that replication happens to.

Or put another way, if people are planning on deploying CouchDB how much space 
would you advise them to provision?  

When I started, the answer for 10million documents/2.5GB of raw JSON is 72GB:

  23GB for DB, another 21GB for the compacted version, 27+GB for view file, 
another 1+GB for compacted view file

By shortening ids to 4 bytes instead of 16 we get:

  4GB for DB, another 4GB for compacted, 27GB for view file, another 1GB for 
compacted view file

By being able to sort my documents to be ordered by the most commonly emitted 
view key:
 
  4GB for DB, another 4GB for compacted, 15GB for view file, another 1GB for 
compacted view file

Since the view/DB coexists at the same time as the compaction you need space 
for both simultaneously. 10 million documents/2GB of data is not something that 
makes any existing database system sweat.

> File format for views is space and time inefficient - use a better one
> ----------------------------------------------------------------------
>
>                 Key: COUCHDB-623
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-623
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Database Core
>    Affects Versions: 0.10
>            Reporter: Roger Binns
>            Assignee: Damien Katz
>
> This was discussed on the dev mailing list over the last few days and noted 
> here so it isn't forgotten.
> The main database file format is optimised for data integrity - not losing or 
> mangling documents - and rightly so.
> That same append-only format is also used for views where it is a poor fit.  
> The more random the ordering of data supplied, the larger the btree.  The 
> larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
> JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
> compacting to 900MB).  Since views are not replicated, this requires a 
> disproportionate amount of disk space on each receiving server (not to 
> mention I/O load).  The format also affects view generation performance.  By 
> loading my documents into CouchDB in an order by the most emitted value in 
> views I was able to reduce load time from 75 minutes to 40 minutes with the 
> view file size being 15GB instead of 27GB, but still very distant from the 
> 900MB post compaction.
> Views are a performance enhancement.  They save you from having to visit 
> every document when doing some queries.  The data within in a view is 
> generated and hence the only consequence of losing view data is a performance 
> one and the view can be regenerated anyway.  Consequently the file format 
> should be one that is optimised for performance and size.  The only integrity 
> feature needed is the ability to tell that the view is potentially corrupt 
> (eg the power failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

Reply via email to