Re: Erlang vs JavaScript

2013-08-19 Thread nicholas a. evans
It seems like there might be several simple internalizing speedups,
even before tackling the view server protocol or the couchjs view
server, hinted at by Alexander's suggestion:

On Fri, Aug 16, 2013 at 3:58 PM, Alexander Shorin kxe...@gmail.com wrote:
 Idea: move document metadata into separate object.
...
 Case 2: Large docs. Profit in case when you have set right fields into
 metadata (like doc type, authorship, tags etc.) and filter first by
 this metadata - you have minimal memory footprint, you have less CPU
 load, rule fast accept - fast reject works perfectly.

For the simple case of filtering which fields are passed to the map
fn, you don't need full blown chained views, you only need a simple
way to define field filters (describing which fields are the relevant
metadata fields).

 Side effect: it's possible to autoindex metadata on fly on document
 update without asking user to write (meta/by_type, meta/by_author,
 meta/by_update_time etc. viiews) . Sure, as much metadata you have as
 large base index will be. In 80% cases it will be no more than 4KB.

Similarly to how the internals of couch already optimize away the case
when you have multiple views in the same design doc share the same map
function (but different reductions), we should also be able to
optimize away the case where multiple views share the same fields
filter.

 Resume: probably, I'd just described chained views feature with
 autoindexing by certain fields (:

One lesson I learned when I looked into implementing chained
map/reduce views is that they will need to be in different design_docs
from the parent views, in order to play nicely with BigCouch.  Keeping
them in the same design_doc just doesn't work with parallel view
builds (at least, not without breaking normal design_doc
considerations).  So although I really like the simplicity of the
keep chained views in one design doc approach, it's probably a
dead-end.

 Removing autoindexing feature and we could make views building process
 much more faster if we make right views chain which will use set
 algebra operations to calculate target doc ids to pass to final view:
 reduce docs before map results:

 {
 views: {
 posts: {map: ..., reduce: ...},
 chain: [
  [by_type, {key: post}],
  [hidden, {key: false}],
  [by_domain, {keys: [public, wiki]}]
   ]
  }
 }

I was inspired by your view syntax and thought I'd put forward my own
similar proposal:

{
  _id: plain_old_views_for_comparison,
  views: {
single_emit: {
  map: function(doc) { if (!doc.foo) { emit([doc.bar, doc.baz],
doc.quux); } },
  reduce: _count
},
multiple_emits: {
  map: function(doc) { if (!doc.foo) { emit([0, doc.bar],
doc.quux); emit(['baz', doc.baz], doc.quux); } },
  reduce: _count
},
}

{
  _id: internalized,
  options: {
filter: !foo,
fields: [bar, baz, quux]
   },
  views: {
single_emit_1: {
  map: function(doc) { emit([doc.bar, doc.baz], doc.quux); },
  reduce: _count
},
single_emit_2: {
  map: { key: [bar, baz], value: quux },
  reduce: _count
},
multiple_emits: {
  map: { emits: [[0, bar], quux], ['baz', baz], quux]] },
  reduce: _count
},
}

Where the above views should behave the same way.  The view options
would support filter as a guard clause and fields to strip out all
but the relevant metadata.  These should be defined at the design
document level to simplify working with the current view server
protocol.  And the view map could optionally be an object describing
the emit values instead of a function string.

The filter string should be simple but powerful: I'd suggest
supporting !, , ||, (), foo.bar.baz, , , =, =, ==, !=,
numbers, and strings (for type == 'foo'). But even if all it
supported was foo and !foo, it would still be useful.  In some
cases, this will prevent most docs from ever needing to be evaluated
by the view server.  The fields array might also consider filtering
nested fields like with foo.bar.baz.  The filter and internal map
(key, value, emits) should support the same values that fields
supports plus numbers and strings, or they could support the same
syntax as filter to do things like key: [!!deleted_at,
deleted_at].  The filter and internal map would be able to use all
of the fields, not just the ones defined in the options.

Another odd case where I've personally noticed indexing speed get
immensely bogged down is when the reduce function merges the map
objects together.  I've seen views with this problem go up to 5GB
during initial load and compact back down to 20MB.  I've documented
this problem and my workaround here:
https://gist.github.com/nevans/5512593.  The hideously reduce pattern
in that gist has resulted in 2-5x faster view builds for me (small DBs
infinitesimally slower, huge speedup for big DBs).  But it would be
*much* better to simply add a minimum_reduced_group_level option to
the view, and let the Erlang handle that without doing unnecessary
view 

Re: Erlang vs JavaScript

2013-08-18 Thread Benoit Chesneau
On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote:

 On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote:
  On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com
  wrote:
 
  On 08/16/2013 11:32 AM, Alexander Shorin wrote:
   On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com
 
   wrote:
   I agree, (modulo the fact that I would replace a string by a binary
 ;)
   but
   that would be only possible if we extract the metadata (_id, _rev)
 from
   the
   JSON so couchdb wouldn't have to decode the JSON to get them.
 Streaming
   json would also allows that but since there is no guaranty in the
   properties order of a JSON it would be less efficient.
  
   What if we split document metadata from document itself?
 
 
  I would like to hear a goal for this effort? What is the definition of
  success and failure?

 Idea: move document metadata into separate object.


How do you link the metadata to the separate object there? Do you let the
application set the internal links?

I'm +1 with such idea anyway.



 Motivation:

 Case 1: Small docs. No profit at all. More over, probably it's better
 to not split things there e.g. pass full doc if his size around some
 amount of megabytes.
 Case 2: Large docs. Profit in case when you have set right fields into
 metadata (like doc type, authorship, tags etc.) and filter first by
 this metadata - you have minimal memory footprint, you have less CPU
 load, rule fast accept - fast reject works perfectly.

 Side effect: it's possible to first filter by metadata and leave only
 required to process document ids. And if we known what and how many to
 process, we may make assumptions about parallel indexation.

 Side effect: it's possible to autoindex metadata on fly on document
 update without asking user to write (meta/by_type, meta/by_author,
 meta/by_update_time etc. viiews) . Sure, as much metadata you have as
 large base index will be. In 80% cases it will be no more than 4KB.

 Resume: probably, I'd just described chained views feature with
 autoindexing by certain fields (:
 Removing autoindexing feature and we could make views building process
 much more faster if we make right views chain which will use set
 algebra operations to calculate target doc ids to pass to final view:
 reduce docs before map results:

 {
 views: {
 posts: {map: ..., reduce: ...},
 chain: [
  [by_type, {key: post}],
  [hidden, {key: false}],
  [by_domain, {keys: [public, wiki]}]
   ]
  }
 }

 In case of 1 docs db with 1200 posts where 200 are hidden and 400
 are private, result view posts have to process only 600 docs instead
 of 1 and it's index lookup operation to find out the result docs
 to pass. Sure, calling such view triggers all views in the chain. And
 I don't think about cross dependencies and loops for know.

 --
 ,,,^..^,,,



Re: Erlang vs JavaScript

2013-08-18 Thread Alexander Shorin
On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau bchesn...@gmail.com wrote:
 On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote:

 On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote:
  On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com
  wrote:
 
  On 08/16/2013 11:32 AM, Alexander Shorin wrote:
   On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com
 
   wrote:
   I agree, (modulo the fact that I would replace a string by a binary
 ;)
   but
   that would be only possible if we extract the metadata (_id, _rev)
 from
   the
   JSON so couchdb wouldn't have to decode the JSON to get them.
 Streaming
   json would also allows that but since there is no guaranty in the
   properties order of a JSON it would be less efficient.
  
   What if we split document metadata from document itself?
 
 
  I would like to hear a goal for this effort? What is the definition of
  success and failure?

 Idea: move document metadata into separate object.


 How do you link the metadata to the separate object there? Do you let the
 application set the internal links?

 I'm +1 with such idea anyway.

Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!):

Btree:

+
   ||
 --+----+--
||  ||
**  **

At the node we have doc object {...} for specific revision. Instead of
this, we'll have a tuple ({...}, {...}) - first is a meta, second is a
data.
So I think there wouldn't be needed internal links since meta and data
would live within same Btree node.
For regular doc requesting, they will be merged (still need for `_`
prefix to avoid collisions?) and returned as single {...} as always.

--
,,,^..^,,,


Re: Erlang vs JavaScript

2013-08-18 Thread Volker Mische
On 08/18/2013 08:42 AM, Alexander Shorin wrote:
 On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau bchesn...@gmail.com wrote:
 On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote:

 On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote:
 On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com
 wrote:

 On 08/16/2013 11:32 AM, Alexander Shorin wrote:
 On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com

 wrote:
 I agree, (modulo the fact that I would replace a string by a binary
 ;)
 but
 that would be only possible if we extract the metadata (_id, _rev)
 from
 the
 JSON so couchdb wouldn't have to decode the JSON to get them.
 Streaming
 json would also allows that but since there is no guaranty in the
 properties order of a JSON it would be less efficient.

 What if we split document metadata from document itself?


 I would like to hear a goal for this effort? What is the definition of
 success and failure?

 Idea: move document metadata into separate object.


 How do you link the metadata to the separate object there? Do you let the
 application set the internal links?

 I'm +1 with such idea anyway.
 
 Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!):
 
 Btree:
 
 +
||
  --+----+--
 ||  ||
 **  **
 
 At the node we have doc object {...} for specific revision. Instead of
 this, we'll have a tuple ({...}, {...}) - first is a meta, second is a
 data.
 So I think there wouldn't be needed internal links since meta and data
 would live within same Btree node.
 For regular doc requesting, they will be merged (still need for `_`
 prefix to avoid collisions?) and returned as single {...} as always.

We could also return them as separate objects, so the view function
becomes: function(doc, meta) {}.

Couchbase does that and from my experience it works well and feel right.

Cheers,
  Volker



Re: Erlang vs JavaScript

2013-08-18 Thread Alexander Shorin
On Sun, Aug 18, 2013 at 3:54 PM, Volker Mische volker.mis...@gmail.com wrote:
 On 08/18/2013 08:42 AM, Alexander Shorin wrote:
 On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau bchesn...@gmail.com 
 wrote:
 On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote:

 On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote:
 On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com
 wrote:

 On 08/16/2013 11:32 AM, Alexander Shorin wrote:
 On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com

 wrote:
 I agree, (modulo the fact that I would replace a string by a binary
 ;)
 but
 that would be only possible if we extract the metadata (_id, _rev)
 from
 the
 JSON so couchdb wouldn't have to decode the JSON to get them.
 Streaming
 json would also allows that but since there is no guaranty in the
 properties order of a JSON it would be less efficient.

 What if we split document metadata from document itself?


 I would like to hear a goal for this effort? What is the definition of
 success and failure?

 Idea: move document metadata into separate object.


 How do you link the metadata to the separate object there? Do you let the
 application set the internal links?

 I'm +1 with such idea anyway.

 Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!):

 Btree:

 +
||
  --+----+--
 ||  ||
 **  **

 At the node we have doc object {...} for specific revision. Instead of
 this, we'll have a tuple ({...}, {...}) - first is a meta, second is a
 data.
 So I think there wouldn't be needed internal links since meta and data
 would live within same Btree node.
 For regular doc requesting, they will be merged (still need for `_`
 prefix to avoid collisions?) and returned as single {...} as always.

 We could also return them as separate objects, so the view function
 becomes: function(doc, meta) {}.

 Couchbase does that and from my experience it works well and feel right.

Oh, so this idea even works (:

However, the trick was about to not pass doc part (in case if it big
enough) to the view server until view server wouldn't process his
metadata. Otherwise this is good feature, but it wouldn't help with
indexing speed up. I remind the trick: first process meta part and if
it passed - load the doc. Later I'd sent another mail where I'd
eventually reinvented chained views, because trick with meta does
exactly the same, chained views are more correct way to go. See quote
at the end with resume.

Anyway, I feel we need to inherit Couchbase experience with document's
metadata object (of course if they wouldn't sue us for that ((: )
since everyone already same some preferred metadata fields (like type)
or uses special object for that to not pollute main document body.
I'm prefer special '.meta' object at the document root which holds
document type info, authorship, timestamps, bindings, etc.
It's good feature to have no matter does it optimizes indexation
process or not (:

Below is about chained views:

On Fri, Aug 16, 2013 at 11:58 PM, Alexander Shorin kxe...@gmail.com wrote:
 Resume: probably, I'd just described chained views feature with
 autoindexing by certain fields (:
 Removing autoindexing feature and we could make views building process
 much more faster if we make right views chain which will use set
 algebra operations to calculate target doc ids to pass to final view:
 reduce docs before map results:

 {
 views: {
 posts: {map: ..., reduce: ...},
 chain: [
  [by_type, {key: post}],
  [hidden, {key: false}],
  [by_domain, {keys: [public, wiki]}]
   ]
  }
 }

 In case of 1 docs db with 1200 posts where 200 are hidden and 400
 are private, result view posts have to process only 600 docs instead
 of 1 and it's index lookup operation to find out the result docs
 to pass. Sure, calling such view triggers all views in the chain.

--
,,,^..^,,,


Re: Erlang vs JavaScript

2013-08-18 Thread Robert Keizer

On 13-08-18 09:33 AM, Alexander Shorin wrote:

On Sun, Aug 18, 2013 at 3:54 PM, Volker Mische volker.mis...@gmail.com wrote:

On 08/18/2013 08:42 AM, Alexander Shorin wrote:

On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau bchesn...@gmail.com wrote:

On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote:


On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote:

On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com
wrote:

On 08/16/2013 11:32 AM, Alexander Shorin wrote:

On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com
wrote:

I agree, (modulo the fact that I would replace a string by a binary

;)

but
that would be only possible if we extract the metadata (_id, _rev)

from

the
JSON so couchdb wouldn't have to decode the JSON to get them.

Streaming

json would also allows that but since there is no guaranty in the
properties order of a JSON it would be less efficient.

What if we split document metadata from document itself?


I would like to hear a goal for this effort? What is the definition of
success and failure?

Idea: move document metadata into separate object.


How do you link the metadata to the separate object there? Do you let the
application set the internal links?

I'm +1 with such idea anyway.

Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!):

Btree:

 +
||
  --+----+--
||  ||
**  **

At the node we have doc object {...} for specific revision. Instead of
this, we'll have a tuple ({...}, {...}) - first is a meta, second is a
data.
So I think there wouldn't be needed internal links since meta and data
would live within same Btree node.
For regular doc requesting, they will be merged (still need for `_`
prefix to avoid collisions?) and returned as single {...} as always.

We could also return them as separate objects, so the view function
becomes: function(doc, meta) {}.

Couchbase does that and from my experience it works well and feel right.

Oh, so this idea even works (:

However, the trick was about to not pass doc part (in case if it big
enough) to the view server until view server wouldn't process his
metadata. Otherwise this is good feature, but it wouldn't help with
indexing speed up. I remind the trick: first process meta part and if
it passed - load the doc. Later I'd sent another mail where I'd
eventually reinvented chained views, because trick with meta does
exactly the same, chained views are more correct way to go. See quote
at the end with resume.

Anyway, I feel we need to inherit Couchbase experience with document's
metadata object (of course if they wouldn't sue us for that ((: )
since everyone already same some preferred metadata fields (like type)
or uses special object for that to not pollute main document body.
I'm prefer special '.meta' object at the document root which holds
document type info, authorship, timestamps, bindings, etc.
It's good feature to have no matter does it optimizes indexation
process or not (:


I would suggest either prefixing with an underscore, or the use of a 
separate object passed to the view server.


If someone ( such as myself ) has many many documents, which happen to 
contain a meta attribute, it would be non-trivial to upgrade / 
migrate. A migration script could be written of course, although it 
wouldn't be ideal;


Something to consider, it may be worth while to simply use obj._meta 
instead of .meta.




Below is about chained views:

On Fri, Aug 16, 2013 at 11:58 PM, Alexander Shorin kxe...@gmail.com wrote:

Resume: probably, I'd just described chained views feature with
autoindexing by certain fields (:
Removing autoindexing feature and we could make views building process
much more faster if we make right views chain which will use set
algebra operations to calculate target doc ids to pass to final view:
reduce docs before map results:

{
views: {
 posts: {map: ..., reduce: ...},
 chain: [
  [by_type, {key: post}],
  [hidden, {key: false}],
  [by_domain, {keys: [public, wiki]}]
   ]
  }
}

In case of 1 docs db with 1200 posts where 200 are hidden and 400
are private, result view posts have to process only 600 docs instead
of 1 and it's index lookup operation to find out the result docs
to pass. Sure, calling such view triggers all views in the chain.




Chained views would be awesome! I'm sure I'm not alone in having solved 
this problem by using multiple queries and matching document IDs.


Re: Erlang vs JavaScript

2013-08-17 Thread JFC Morfin

At 11:49 16/08/2013, Volker Mische wrote:

 What if we split document metadata from document itself? E.g. pass
 _id, _rev and other system or meta fields with separate object. Their
 size much lesser than whole document, so it will be possible to fast
 decode this metadata and decide is doc need to be processed or not
 without need to decode/encode megabytes of document's json. Sure, this
 adds additional communication roundtrip, but in case if it will be
 faster than json decode/encode - why not?

That would be the ultimate-ultimate goal.


This is a basic requirement for me: incrementally (i.e. metadata on 
metadata) and for syllodata (data between data) interlinks.
jfc 



Re: Erlang vs JavaScript

2013-08-16 Thread Volker Mische
On 08/15/2013 11:53 AM, Benoit Chesneau wrote:
 On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt j...@apache.org wrote:
 

 On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote:

 A big +1 to Jason's clarification of erlang vs native. CouchDB
 could have shipped an erlang view server that worked in a separate
 process and had the stdio overhead, to combine the slowness of the
 protocol with the obtuseness of erlang. ;)

 Evaluating Javascript within the erlang VM process intrigues me, Jens,
 how is that done in your case? I've not previously found the assertion
 that V8 would be faster than SpiderMonkey for a view server compelling
 since the bottleneck is almost never in the code evaluation, but I do
 support CouchDB switching to it for the synergy effects of a closer
 binding with node.js, but if it's running in the same process, that
 would change (though I don't immediately see why the same couldn't be
 done for SpiderMonkey). Off the top of my head, I don't know a safe
 way to evaluate JS in the VM. A NIF-based approach would either be
 quite elaborate or would trip all the scheduling problems that
 long-running NIF's are now notorious for.

 At a step removed, the view server protocol itself seems like the
 thing to improve on, it feels like that's the principal bottleneck.

 The code is here:
 https://github.com/couchbase/couchdb/tree/master/src/mapreduce

 I’d love for someone to pick this up and give CouchDB, say, a ./configure
 --enable-native-v8 option or a plugin that allows people to opt into the
 speed improvements made there. :)

 The choice for V8 was made because of easier integration API and more
 reliable releases as a standalone project, which I think was a smart move.

 IIRC it relies on a change to CouchDB-y internals that has not made it
 back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading
 this thread), but we should look into that and get us “native JS views”, at
 least as an option or plugin.

 CCing dev@.

 Jan
 --


 Well on the first hand nifs look like a good idea but can be very
 problematic:
 
 - when the view computation take time it would block the full vm
 scheduling. It can be mitigated using a pool of threads to execute the work
 asynchronously but then can create other problems like memory leaking etc.
 - nifs can't be upgraded easily during hot upgrade
 - when a nif crash, all the vm crash.
 
 (Note that we have the same problem when using a nif to decode/encode json,
 it only works well with medium sized documents)
 
 One other way to improve the js handling would be removing the main
 bottleneck ie the serialization-deserialization we do on each step. Not
 sure if it exists but  feasible, why not passing erlang terms from erlang
 to js and js to erlang? So at the end the deserialization would happen only
 on the JS side ie instead of having
 
 get erlang term
 encode to json
 send to js
 decode json
 process
 encode json
 send json
 decode json to erlang term
 store
 
 we sould just have
 
 get erlang term
 send over STDIO
 decode erlang term to JS object
 process
 encode to erlang term
 send erlang term
 store
 
 Erlang serialization is also very optimised.

I think the ultimate goal should be to be as little
conversion/serialisation as possible, hence no conversion to Erlang
Terms at all.

Input as string
Parsing to get ID
Store as string

Send to JS as string
Process with JS
Store as string

Cheers,
  Volker




Re: Erlang vs JavaScript

2013-08-16 Thread Benoit Chesneau
On Fri, Aug 16, 2013 at 11:05 AM, Volker Mische volker.mis...@gmail.comwrote:

 On 08/15/2013 11:53 AM, Benoit Chesneau wrote:
  On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt j...@apache.org wrote:
 
 
  On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote:
 
  A big +1 to Jason's clarification of erlang vs native. CouchDB
  could have shipped an erlang view server that worked in a separate
  process and had the stdio overhead, to combine the slowness of the
  protocol with the obtuseness of erlang. ;)
 
  Evaluating Javascript within the erlang VM process intrigues me, Jens,
  how is that done in your case? I've not previously found the assertion
  that V8 would be faster than SpiderMonkey for a view server compelling
  since the bottleneck is almost never in the code evaluation, but I do
  support CouchDB switching to it for the synergy effects of a closer
  binding with node.js, but if it's running in the same process, that
  would change (though I don't immediately see why the same couldn't be
  done for SpiderMonkey). Off the top of my head, I don't know a safe
  way to evaluate JS in the VM. A NIF-based approach would either be
  quite elaborate or would trip all the scheduling problems that
  long-running NIF's are now notorious for.
 
  At a step removed, the view server protocol itself seems like the
  thing to improve on, it feels like that's the principal bottleneck.
 
  The code is here:
  https://github.com/couchbase/couchdb/tree/master/src/mapreduce
 
  I’d love for someone to pick this up and give CouchDB, say, a
 ./configure
  --enable-native-v8 option or a plugin that allows people to opt into the
  speed improvements made there. :)
 
  The choice for V8 was made because of easier integration API and more
  reliable releases as a standalone project, which I think was a smart
 move.
 
  IIRC it relies on a change to CouchDB-y internals that has not made it
  back from Couchbase to CouchDB (Filipe will know, but I doubt he’s
 reading
  this thread), but we should look into that and get us “native JS
 views”, at
  least as an option or plugin.
 
  CCing dev@.
 
  Jan
  --
 
 
  Well on the first hand nifs look like a good idea but can be very
  problematic:
 
  - when the view computation take time it would block the full vm
  scheduling. It can be mitigated using a pool of threads to execute the
 work
  asynchronously but then can create other problems like memory leaking
 etc.
  - nifs can't be upgraded easily during hot upgrade
  - when a nif crash, all the vm crash.
 
  (Note that we have the same problem when using a nif to decode/encode
 json,
  it only works well with medium sized documents)
 
  One other way to improve the js handling would be removing the main
  bottleneck ie the serialization-deserialization we do on each step. Not
  sure if it exists but  feasible, why not passing erlang terms from erlang
  to js and js to erlang? So at the end the deserialization would happen
 only
  on the JS side ie instead of having
 
  get erlang term
  encode to json
  send to js
  decode json
  process
  encode json
  send json
  decode json to erlang term
  store
 
  we sould just have
 
  get erlang term
  send over STDIO
  decode erlang term to JS object
  process
  encode to erlang term
  send erlang term
  store
 
  Erlang serialization is also very optimised.

 I think the ultimate goal should be to be as little
 conversion/serialisation as possible, hence no conversion to Erlang
 Terms at all.

 Input as string
 Parsing to get ID
 Store as string

 Send to JS as string
 Process with JS
 Store as string

 Cheers,
   Volker




I agree, (modulo the fact that I would replace a string by a binary ;) but
that would be only possible if we extract the metadata (_id, _rev) from the
JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
json would also allows that but since there is no guaranty in the
properties order of a JSON it would be less efficient.

- benoit


Re: Erlang vs JavaScript

2013-08-16 Thread Alexander Shorin
On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote:
 I agree, (modulo the fact that I would replace a string by a binary ;) but
 that would be only possible if we extract the metadata (_id, _rev) from the
 JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
 json would also allows that but since there is no guaranty in the
 properties order of a JSON it would be less efficient.

What if we split document metadata from document itself? E.g. pass
_id, _rev and other system or meta fields with separate object. Their
size much lesser than whole document, so it will be possible to fast
decode this metadata and decide is doc need to be processed or not
without need to decode/encode megabytes of document's json. Sure, this
adds additional communication roundtrip, but in case if it will be
faster than json decode/encode - why not?

--
,,,^..^,,,


Re: Erlang vs JavaScript

2013-08-16 Thread Volker Mische
On 08/16/2013 11:32 AM, Alexander Shorin wrote:
 On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote:
 I agree, (modulo the fact that I would replace a string by a binary ;) but
 that would be only possible if we extract the metadata (_id, _rev) from the
 JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
 json would also allows that but since there is no guaranty in the
 properties order of a JSON it would be less efficient.
 
 What if we split document metadata from document itself? E.g. pass
 _id, _rev and other system or meta fields with separate object. Their
 size much lesser than whole document, so it will be possible to fast
 decode this metadata and decide is doc need to be processed or not
 without need to decode/encode megabytes of document's json. Sure, this
 adds additional communication roundtrip, but in case if it will be
 faster than json decode/encode - why not?

That would be the ultimate-ultimate goal.

Cheers,
  Volker




Re: Erlang vs JavaScript

2013-08-16 Thread Jason Smith
On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.comwrote:

 On 08/16/2013 11:32 AM, Alexander Shorin wrote:
  On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com
 wrote:
  I agree, (modulo the fact that I would replace a string by a binary ;)
 but
  that would be only possible if we extract the metadata (_id, _rev) from
 the
  JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
  json would also allows that but since there is no guaranty in the
  properties order of a JSON it would be less efficient.
 
  What if we split document metadata from document itself?


I would like to hear a goal for this effort? What is the definition of
success and failure?

Jan makes a fine point on user@. I live with the pain. But really, life
is pain. Deny it if you must. Until we are delivered--finally!--our sweet
release, we will necessarily endure pain.

Facts:

* When you store a record, a machine must write that to storage
* If you have an index, a machine must update the index to storage

Building an index requires visiting every document. One way or another, the
entire .couch file is coming off the disk and going through the ringer. One
way or another, every row in the view will be written. I am not clear why
optimizing from N ms/doc to 1/2 N ms/doc will help, when you still have to
read 30GB from storage, and write 30GB back.

One one end the computer scientist says we cannot avoid the necessary time
complexity. On the other end, the casual user says, if it is not
instantaneous, then it hardly matters.

That is, we have a problem of expectation management, not codec speed.
Nobody expects MySQL's CREATE INDEX to finish in a flash, and nobody should
expect that of a view.

If somebody does set out to accelerate views, you're welcome. But I would
ask: what is a successful optimization, and why?

(Also, Noah, if you are out there, this is an example of the sort of thing
I would put on the wiki but past bad experiences make me say can't be
bothered.)


Re: Erlang vs JavaScript

2013-08-16 Thread Alexander Shorin
On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote:
 On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com
 wrote:

 On 08/16/2013 11:32 AM, Alexander Shorin wrote:
  On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com
  wrote:
  I agree, (modulo the fact that I would replace a string by a binary ;)
  but
  that would be only possible if we extract the metadata (_id, _rev) from
  the
  JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
  json would also allows that but since there is no guaranty in the
  properties order of a JSON it would be less efficient.
 
  What if we split document metadata from document itself?


 I would like to hear a goal for this effort? What is the definition of
 success and failure?

Idea: move document metadata into separate object.

Motivation:

Case 1: Small docs. No profit at all. More over, probably it's better
to not split things there e.g. pass full doc if his size around some
amount of megabytes.
Case 2: Large docs. Profit in case when you have set right fields into
metadata (like doc type, authorship, tags etc.) and filter first by
this metadata - you have minimal memory footprint, you have less CPU
load, rule fast accept - fast reject works perfectly.

Side effect: it's possible to first filter by metadata and leave only
required to process document ids. And if we known what and how many to
process, we may make assumptions about parallel indexation.

Side effect: it's possible to autoindex metadata on fly on document
update without asking user to write (meta/by_type, meta/by_author,
meta/by_update_time etc. viiews) . Sure, as much metadata you have as
large base index will be. In 80% cases it will be no more than 4KB.

Resume: probably, I'd just described chained views feature with
autoindexing by certain fields (:
Removing autoindexing feature and we could make views building process
much more faster if we make right views chain which will use set
algebra operations to calculate target doc ids to pass to final view:
reduce docs before map results:

{
views: {
posts: {map: ..., reduce: ...},
chain: [
 [by_type, {key: post}],
 [hidden, {key: false}],
 [by_domain, {keys: [public, wiki]}]
  ]
 }
}

In case of 1 docs db with 1200 posts where 200 are hidden and 400
are private, result view posts have to process only 600 docs instead
of 1 and it's index lookup operation to find out the result docs
to pass. Sure, calling such view triggers all views in the chain. And
I don't think about cross dependencies and loops for know.

--
,,,^..^,,,


Re: Erlang vs JavaScript

2013-08-15 Thread Jan Lehnardt

On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote:

 A big +1 to Jason's clarification of erlang vs native. CouchDB
 could have shipped an erlang view server that worked in a separate
 process and had the stdio overhead, to combine the slowness of the
 protocol with the obtuseness of erlang. ;)
 
 Evaluating Javascript within the erlang VM process intrigues me, Jens,
 how is that done in your case? I've not previously found the assertion
 that V8 would be faster than SpiderMonkey for a view server compelling
 since the bottleneck is almost never in the code evaluation, but I do
 support CouchDB switching to it for the synergy effects of a closer
 binding with node.js, but if it's running in the same process, that
 would change (though I don't immediately see why the same couldn't be
 done for SpiderMonkey). Off the top of my head, I don't know a safe
 way to evaluate JS in the VM. A NIF-based approach would either be
 quite elaborate or would trip all the scheduling problems that
 long-running NIF's are now notorious for.
 
 At a step removed, the view server protocol itself seems like the
 thing to improve on, it feels like that's the principal bottleneck.

The code is here: https://github.com/couchbase/couchdb/tree/master/src/mapreduce

I’d love for someone to pick this up and give CouchDB, say, a ./configure 
--enable-native-v8 option or a plugin that allows people to opt into the speed 
improvements made there. :)

The choice for V8 was made because of easier integration API and more reliable 
releases as a standalone project, which I think was a smart move.

IIRC it relies on a change to CouchDB-y internals that has not made it back 
from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading this 
thread), but we should look into that and get us “native JS views”, at least as 
an option or plugin.

CCing dev@.

Jan
--





 
 B.
 
 
 On 15 August 2013 08:22, Jason Smith j...@apache.org wrote:
 Yes, to a first approximation, with a native view, CouchDB is basically
 running eval() on your code. In my example, I took advantage of this to
 build a nonstandard response to satisfy an application. (Instead of a 404,
 we sent a designated fallback document body.)
 
 But, if you accumulate the list in a native view, a JavaScript view, or a
 hypothetical Erlang view (i.e. a subprocess), from the operating system's
 perspective, the memory for that list will be allocated somewhere. Either
 the CouchDB process asks for X KB more memory, or its subprocess will ask
 for it. So I think the total system impact is probably low in practice.
 
 So I guess my point is not that native views are wrong, just they have a
 cost so you should weigh the cost/benefit for your own project. In the case
 of manage_couchdb, I wrote a JavaScript implementation; but since sometimes
 I have an emergency and I must find conflicts ASAP, I made an Erlang
 version because it is worth it.
 
 
 On Thu, Aug 15, 2013 at 2:05 PM, Stanley Iriele siriele...@gmail.comwrote:
 
 Whoa...OK...that I had no idea about...thanks for taking the time to go to
 that granularity, by the way.
 
 So does this mean that the process memory is shared? As apposed to living
 in its own space?.so if someone accumulates a large json object in a list
 function its chewing up couchdb's memory?... I guess I'm a little confused
 about what's in the same process and what isn't now
 On Aug 14, 2013 11:57 PM, Jason Smith j...@apache.org wrote:
 
 To me, an Erlang view is a view server which supports map, reduce, show,
 update, list, etc. functions in the Erlang language. (Basically it is
 implemented in Erlang.)
 
 A view server is a subprocess that runs beneath CouchDB which
 communicates
 with it over standard i/o. It is a different process in the operating
 system and only interfaces with the main server using the view server
 protocol (basically a bunch of JSON messages going back and forth).
 
 I do not know of an Erlang view server which works well and is currently
 maintained.
 
 A native view (shipped by CouchDB but disabled by default) is some
 corner-cutting. Code is evaluated directly by the primary CouchDB server.
 Since CouchDB is Erlang, the native query server is necessarily Erlang.
 The
 key difference is, your code is right there in the eye of the storm. You
 can call couch_server:open(some_db) and completely circumvent security
 and other invariants which CouchDB enforces. You can leak memory until
 the
 kernel OOM killer terminates CouchDB. It's not about the language, it's
 that is is running inside the CouchDB process.
 
 
 
 On Thu, Aug 15, 2013 at 1:36 PM, Stanley Iriele siriele...@gmail.com
 wrote:
 
 WaitI'm a tad confused here..Jason what is the difference between
 native views and Erlang views?...
 On Aug 14, 2013 11:16 PM, Jason Smith j...@apache.org wrote:
 
 Oh, also:
 
 They are **not** Erlang views. They are **native** views. We should
 emphasize the latter to remind ourselves about the security and
 

Re: Erlang vs JavaScript

2013-08-15 Thread Benoit Chesneau
On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt j...@apache.org wrote:


 On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote:

  A big +1 to Jason's clarification of erlang vs native. CouchDB
  could have shipped an erlang view server that worked in a separate
  process and had the stdio overhead, to combine the slowness of the
  protocol with the obtuseness of erlang. ;)
 
  Evaluating Javascript within the erlang VM process intrigues me, Jens,
  how is that done in your case? I've not previously found the assertion
  that V8 would be faster than SpiderMonkey for a view server compelling
  since the bottleneck is almost never in the code evaluation, but I do
  support CouchDB switching to it for the synergy effects of a closer
  binding with node.js, but if it's running in the same process, that
  would change (though I don't immediately see why the same couldn't be
  done for SpiderMonkey). Off the top of my head, I don't know a safe
  way to evaluate JS in the VM. A NIF-based approach would either be
  quite elaborate or would trip all the scheduling problems that
  long-running NIF's are now notorious for.
 
  At a step removed, the view server protocol itself seems like the
  thing to improve on, it feels like that's the principal bottleneck.

 The code is here:
 https://github.com/couchbase/couchdb/tree/master/src/mapreduce

 I’d love for someone to pick this up and give CouchDB, say, a ./configure
 --enable-native-v8 option or a plugin that allows people to opt into the
 speed improvements made there. :)

 The choice for V8 was made because of easier integration API and more
 reliable releases as a standalone project, which I think was a smart move.

 IIRC it relies on a change to CouchDB-y internals that has not made it
 back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading
 this thread), but we should look into that and get us “native JS views”, at
 least as an option or plugin.

 CCing dev@.

 Jan
 --


Well on the first hand nifs look like a good idea but can be very
problematic:

- when the view computation take time it would block the full vm
scheduling. It can be mitigated using a pool of threads to execute the work
asynchronously but then can create other problems like memory leaking etc.
- nifs can't be upgraded easily during hot upgrade
- when a nif crash, all the vm crash.

(Note that we have the same problem when using a nif to decode/encode json,
it only works well with medium sized documents)

One other way to improve the js handling would be removing the main
bottleneck ie the serialization-deserialization we do on each step. Not
sure if it exists but  feasible, why not passing erlang terms from erlang
to js and js to erlang? So at the end the deserialization would happen only
on the JS side ie instead of having

get erlang term
encode to json
send to js
decode json
process
encode json
send json
decode json to erlang term
store

we sould just have

get erlang term
send over STDIO
decode erlang term to JS object
process
encode to erlang term
send erlang term
store

Erlang serialization is also very optimised.


Both solutions could co-exist, that may worh a try and benchmark each...


- benoit


Re: Erlang vs JavaScript

2013-08-15 Thread Jan Lehnardt

On Aug 15, 2013, at 11:53 , Benoit Chesneau bchesn...@gmail.com wrote:

 On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt j...@apache.org wrote:
 
 
 On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote:
 
 A big +1 to Jason's clarification of erlang vs native. CouchDB
 could have shipped an erlang view server that worked in a separate
 process and had the stdio overhead, to combine the slowness of the
 protocol with the obtuseness of erlang. ;)
 
 Evaluating Javascript within the erlang VM process intrigues me, Jens,
 how is that done in your case? I've not previously found the assertion
 that V8 would be faster than SpiderMonkey for a view server compelling
 since the bottleneck is almost never in the code evaluation, but I do
 support CouchDB switching to it for the synergy effects of a closer
 binding with node.js, but if it's running in the same process, that
 would change (though I don't immediately see why the same couldn't be
 done for SpiderMonkey). Off the top of my head, I don't know a safe
 way to evaluate JS in the VM. A NIF-based approach would either be
 quite elaborate or would trip all the scheduling problems that
 long-running NIF's are now notorious for.
 
 At a step removed, the view server protocol itself seems like the
 thing to improve on, it feels like that's the principal bottleneck.
 
 The code is here:
 https://github.com/couchbase/couchdb/tree/master/src/mapreduce
 
 I’d love for someone to pick this up and give CouchDB, say, a ./configure
 --enable-native-v8 option or a plugin that allows people to opt into the
 speed improvements made there. :)
 
 The choice for V8 was made because of easier integration API and more
 reliable releases as a standalone project, which I think was a smart move.
 
 IIRC it relies on a change to CouchDB-y internals that has not made it
 back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading
 this thread), but we should look into that and get us “native JS views”, at
 least as an option or plugin.
 
 CCing dev@.
 
 Jan
 --
 
 
 Well on the first hand nifs look like a good idea but can be very
 problematic:
 
 - when the view computation take time it would block the full vm
 scheduling. It can be mitigated using a pool of threads to execute the work
 asynchronously but then can create other problems like memory leaking etc.
 - nifs can't be upgraded easily during hot upgrade
 - when a nif crash, all the vm crash.

Yeah totally, hence making the whole thing an option.

 (Note that we have the same problem when using a nif to decode/encode json,
 it only works well with medium sized documents)




 One other way to improve the js handling would be removing the main
 bottleneck ie the serialization-deserialization we do on each step. Not
 sure if it exists but  feasible, why not passing erlang terms from erlang
 to js and js to erlang? So at the end the deserialization would happen only
 on the JS side ie instead of having
 
 get erlang term
 encode to json
 send to js
 decode json
 process
 encode json
 send json
 decode json to erlang term
 store
 
 we sould just have
 
 get erlang term
 send over STDIO
 decode erlang term to JS object
 process
 encode to erlang term
 send erlang term
 store
 
 Erlang serialization is also very optimised.
 
 
 Both solutions could co-exist, that may worh a try and benchmark each...

I think we just want both solutions period, the embedded one will still be 
faster, but potentially a little less stable, and the external view server one 
will be slower but extremely robust. Users should be able to choose between 
them :)

Best
Jan
--




signature.asc
Description: Message signed with OpenPGP using GPGMail