Re: Bulk Load

Ronny Hanssen Sun, 14 Sep 2008 14:50:11 -0700

Thanks for your reply, Jan.

I do remember the discussion in the mailinglist, but at the time I didn't
understand the argumentation. Maybe because I really didn't have time to
dive into the matter back then. But, it seriously has puzzled me since. Then
this post appears and I jump at the chance to get this cleared out (sorry
for being slow - which makes me the opposite of arrogant I guess :D).

But, I don't have a solution. I guess you are right in that sense. I just
fail to see that making new docs are making life easier? I believe it makes
the single node case worse and probably equally difficult (or worse) for the
distributed multiple node architecture. Reading from what you say, there is
"evil" lurking in the replication process no matter which way we handle
this. I mean, for multiple nodes the replication would probably be slower
than the return to the users changing the same doc on two different nodes to
be informed. This would result in multiple versions of the same doc being
around, at least until replication - when couchdb would find out that two
competing versions exist. I might be wrong about this, but the users can't
be left waiting for an "ok-saved" reply from couchdb "forever", right? So,
couchdb would have to decide which version "wins" during replication, right?

Considering the effects you are hinting about, I'd personally want a single
node couchdb for writes, with extra nodes for reading and serving views...
Maybe additional write-nodes for different doc-types (one write-node pr
doc-type)... Just to "ensure" that there cannot be two+ docs updated at two+
nodes simultaneously. That is, in the beginning I'd really rather go for a
single node, with a replicated backup/failover. As (if) system stress
increase I'd opt for splitting write and reads on nodes and/or creating
write-nodes designated for different doc-types. This is still not perfect,
but distributed never will be, really.

Unless... If the couchdb data was stored in a distributed file-system (NAS
or SAN), each copy of the couchdb process would be operating on the same
disk. This doesn't mean more data-reliability and also imposes delays in
reads and writes. But, it would mean that couchdb would be scalable
(multiple (vurtual" nodes work on same physical disk). Other "physical"
nodes could be created that would replicate as couchdb is set up to do
already. So, allowing "virtual" nodes could work out as a nice addition I
think.

But, then again, my knowledge in distributed file-systems (NAS or SAN) are
really limited... And, I might have missed out on alot more than that - so
all this might of course just be stupid :)

Thank's for reading.

~Ronny

2008/9/14 Jan Lehnardt <[EMAIL PROTECTED]>

> Hi Ronny,
> On Sep 14, 2008, at 11:45, Ronny Hanssen wrote:
>
>> Or have I seriously missed out on some vital information?  Because, based
>> on
>> the above I still feel very confused about why we cannot use the built-in
>> rev-control mechanism.
>>
>
> You correctly identify that adding revision control to a single node
> instance of
> CouchDB is not that hard (a quick search through the archives would have
> told
> you, too :-) Making all that work in a distributed environment with
> replication conflict
> detection and all is mighty hard. If you can come up with a nice an clean
> solution to
> make proper revision control work with CouchDB's replication including all
> the weird
> edge cases I don't even know about (aren't I arrogant this morning? :), we
> are happy
> to hear about it.
>
> Cheers
> Jan
> --
>
>
>
>
>
>>
>> ~Ronny
>>
>> 2008/9/14 Jeremy Wall <[EMAIL PROTECTED]>
>>
>>  Two reasons.
>>> * First as I understand it the revisions are not changes between
>>> documents.
>>> They are actual full copies of the document.
>>> * Second revisions get blown away when doing a database compact.
>>> Something
>>> you will more than likely want to do since it eats up database space
>>> fairly
>>> quickly. (see above for the reason why)
>>>
>>> That said there is nothing preventing you from storing revisions in
>>> CouchDB.
>>> You could store a changeset for each document revision is a seperate
>>> revision document that accompanies your main document. It would be really
>>> easy and designing views to take advantage of them to show a revision
>>> history for you document would be really easy.
>>>
>>> I suppose you could use the revisions that CouchDB stores but that
>>> wouldn't
>>> be very efficient since each one is a complete copy of the document. And
>>> you
>>> couldn't depend on that "feature not changing behaviour on you in later
>>> versions since it's not intended for revision history as a feature.
>>>
>>> On Sat, Sep 13, 2008 at 7:24 PM, Ronny Hanssen <[EMAIL PROTECTED]
>>>
>>>> wrote:
>>>>
>>>
>>>  Why is the revision control system in couchdb inadequate for, well,
>>>> revision
>>>> control? I thought that this feature indeed was a feature, not just an
>>>> internal mechanism for resolving conflicts?
>>>> Ronny
>>>>
>>>> 2008/9/14 Calum Miller <[EMAIL PROTECTED]>
>>>>
>>>>  Hi Chris,
>>>>>
>>>>> Many thanks for your prompt response.
>>>>>
>>>>> Storing  a complete new version of each bond/instrument every day seems
>>>>>
>>>> a
>>>
>>>> tad excessive. You can imagine how fast the database will grow overtime
>>>>>
>>>> if a
>>>>
>>>>> unique version of each instrument must be saved, rather than just the
>>>>> individual changes. This must be a common pattern, not confined to
>>>>> investment banking. Any ideas how this pattern can be accommodated
>>>>>
>>>> within
>>>
>>>> CouchDB?
>>>>>
>>>>> Calum Miller
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Chris Anderson wrote:
>>>>>
>>>>>  Calum,
>>>>>>
>>>>>> CouchDB should be easily able to handle this load.
>>>>>>
>>>>>> Please note that the built-in revision system is not designed for
>>>>>> document history. Its sole purpose is to manage conflicting documents
>>>>>> that result from edits done in separate copies of the DB, which are
>>>>>> subsequently replicated into a single DB.
>>>>>>
>>>>>> If you allow CouchDB to create a new document for each daily import of
>>>>>> each security, and create a view which makes these documents available
>>>>>> by security and date, you should be able to access securities history
>>>>>> fairly simply.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> On Sat, Sep 13, 2008 at 12:31 PM, Calum Miller <
>>>>>>
>>>>> [EMAIL PROTECTED]>
>>>
>>>> wrote:
>>>>>>
>>>>>>
>>>>>>  Hi,
>>>>>>>
>>>>>>> I trying to evaluate CouchDB for use within investment banking, yes
>>>>>>>
>>>>>> some
>>>>
>>>>> of
>>>>>>> these banks still exist. I want to load 500,000 bonds into the
>>>>>>>
>>>>>> database
>>>
>>>> with
>>>>>>> each bond containing around 100 fields. I would be looking to bulk
>>>>>>>
>>>>>> load
>>>
>>>> a
>>>>
>>>>> similar amount of these bonds every day whilst maintaining a history
>>>>>>>
>>>>>> via
>>>>
>>>>> the
>>>>>>> revision feature. Are there any bulk load features available for
>>>>>>>
>>>>>> CouchDB
>>>>
>>>>> and
>>>>>>> any tips on how to manage regular loads of this volume?
>>>>>>>
>>>>>>> Many thanks in advance and best of luck with this project.
>>>>>>>
>>>>>>> Calum Miller
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Re: Bulk Load

Reply via email to