Re: Erlang vs JavaScript
It seems like there might be several simple internalizing speedups, even before tackling the view server protocol or the couchjs view server, hinted at by Alexander's suggestion: On Fri, Aug 16, 2013 at 3:58 PM, Alexander Shorin kxe...@gmail.com wrote: Idea: move document metadata into separate object. ... Case 2: Large docs. Profit in case when you have set right fields into metadata (like doc type, authorship, tags etc.) and filter first by this metadata - you have minimal memory footprint, you have less CPU load, rule fast accept - fast reject works perfectly. For the simple case of filtering which fields are passed to the map fn, you don't need full blown chained views, you only need a simple way to define field filters (describing which fields are the relevant metadata fields). Side effect: it's possible to autoindex metadata on fly on document update without asking user to write (meta/by_type, meta/by_author, meta/by_update_time etc. viiews) . Sure, as much metadata you have as large base index will be. In 80% cases it will be no more than 4KB. Similarly to how the internals of couch already optimize away the case when you have multiple views in the same design doc share the same map function (but different reductions), we should also be able to optimize away the case where multiple views share the same fields filter. Resume: probably, I'd just described chained views feature with autoindexing by certain fields (: One lesson I learned when I looked into implementing chained map/reduce views is that they will need to be in different design_docs from the parent views, in order to play nicely with BigCouch. Keeping them in the same design_doc just doesn't work with parallel view builds (at least, not without breaking normal design_doc considerations). So although I really like the simplicity of the keep chained views in one design doc approach, it's probably a dead-end. Removing autoindexing feature and we could make views building process much more faster if we make right views chain which will use set algebra operations to calculate target doc ids to pass to final view: reduce docs before map results: { views: { posts: {map: ..., reduce: ...}, chain: [ [by_type, {key: post}], [hidden, {key: false}], [by_domain, {keys: [public, wiki]}] ] } } I was inspired by your view syntax and thought I'd put forward my own similar proposal: { _id: plain_old_views_for_comparison, views: { single_emit: { map: function(doc) { if (!doc.foo) { emit([doc.bar, doc.baz], doc.quux); } }, reduce: _count }, multiple_emits: { map: function(doc) { if (!doc.foo) { emit([0, doc.bar], doc.quux); emit(['baz', doc.baz], doc.quux); } }, reduce: _count }, } { _id: internalized, options: { filter: !foo, fields: [bar, baz, quux] }, views: { single_emit_1: { map: function(doc) { emit([doc.bar, doc.baz], doc.quux); }, reduce: _count }, single_emit_2: { map: { key: [bar, baz], value: quux }, reduce: _count }, multiple_emits: { map: { emits: [[0, bar], quux], ['baz', baz], quux]] }, reduce: _count }, } Where the above views should behave the same way. The view options would support filter as a guard clause and fields to strip out all but the relevant metadata. These should be defined at the design document level to simplify working with the current view server protocol. And the view map could optionally be an object describing the emit values instead of a function string. The filter string should be simple but powerful: I'd suggest supporting !, , ||, (), foo.bar.baz, , , =, =, ==, !=, numbers, and strings (for type == 'foo'). But even if all it supported was foo and !foo, it would still be useful. In some cases, this will prevent most docs from ever needing to be evaluated by the view server. The fields array might also consider filtering nested fields like with foo.bar.baz. The filter and internal map (key, value, emits) should support the same values that fields supports plus numbers and strings, or they could support the same syntax as filter to do things like key: [!!deleted_at, deleted_at]. The filter and internal map would be able to use all of the fields, not just the ones defined in the options. Another odd case where I've personally noticed indexing speed get immensely bogged down is when the reduce function merges the map objects together. I've seen views with this problem go up to 5GB during initial load and compact back down to 20MB. I've documented this problem and my workaround here: https://gist.github.com/nevans/5512593. The hideously reduce pattern in that gist has resulted in 2-5x faster view builds for me (small DBs infinitesimally slower, huge speedup for big DBs). But it would be *much* better to simply add a minimum_reduced_group_level option to the view, and let the Erlang handle that without doing unnecessary view
Re: Erlang vs JavaScript
On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote: On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote: On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com wrote: On 08/16/2013 11:32 AM, Alexander Shorin wrote: On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote: I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. What if we split document metadata from document itself? I would like to hear a goal for this effort? What is the definition of success and failure? Idea: move document metadata into separate object. How do you link the metadata to the separate object there? Do you let the application set the internal links? I'm +1 with such idea anyway. Motivation: Case 1: Small docs. No profit at all. More over, probably it's better to not split things there e.g. pass full doc if his size around some amount of megabytes. Case 2: Large docs. Profit in case when you have set right fields into metadata (like doc type, authorship, tags etc.) and filter first by this metadata - you have minimal memory footprint, you have less CPU load, rule fast accept - fast reject works perfectly. Side effect: it's possible to first filter by metadata and leave only required to process document ids. And if we known what and how many to process, we may make assumptions about parallel indexation. Side effect: it's possible to autoindex metadata on fly on document update without asking user to write (meta/by_type, meta/by_author, meta/by_update_time etc. viiews) . Sure, as much metadata you have as large base index will be. In 80% cases it will be no more than 4KB. Resume: probably, I'd just described chained views feature with autoindexing by certain fields (: Removing autoindexing feature and we could make views building process much more faster if we make right views chain which will use set algebra operations to calculate target doc ids to pass to final view: reduce docs before map results: { views: { posts: {map: ..., reduce: ...}, chain: [ [by_type, {key: post}], [hidden, {key: false}], [by_domain, {keys: [public, wiki]}] ] } } In case of 1 docs db with 1200 posts where 200 are hidden and 400 are private, result view posts have to process only 600 docs instead of 1 and it's index lookup operation to find out the result docs to pass. Sure, calling such view triggers all views in the chain. And I don't think about cross dependencies and loops for know. -- ,,,^..^,,,
Re: Erlang vs JavaScript
On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau bchesn...@gmail.com wrote: On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote: On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote: On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com wrote: On 08/16/2013 11:32 AM, Alexander Shorin wrote: On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote: I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. What if we split document metadata from document itself? I would like to hear a goal for this effort? What is the definition of success and failure? Idea: move document metadata into separate object. How do you link the metadata to the separate object there? Do you let the application set the internal links? I'm +1 with such idea anyway. Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!): Btree: + || --+----+-- || || ** ** At the node we have doc object {...} for specific revision. Instead of this, we'll have a tuple ({...}, {...}) - first is a meta, second is a data. So I think there wouldn't be needed internal links since meta and data would live within same Btree node. For regular doc requesting, they will be merged (still need for `_` prefix to avoid collisions?) and returned as single {...} as always. -- ,,,^..^,,,
Re: Erlang vs JavaScript
On 08/18/2013 08:42 AM, Alexander Shorin wrote: On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau bchesn...@gmail.com wrote: On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote: On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote: On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com wrote: On 08/16/2013 11:32 AM, Alexander Shorin wrote: On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote: I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. What if we split document metadata from document itself? I would like to hear a goal for this effort? What is the definition of success and failure? Idea: move document metadata into separate object. How do you link the metadata to the separate object there? Do you let the application set the internal links? I'm +1 with such idea anyway. Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!): Btree: + || --+----+-- || || ** ** At the node we have doc object {...} for specific revision. Instead of this, we'll have a tuple ({...}, {...}) - first is a meta, second is a data. So I think there wouldn't be needed internal links since meta and data would live within same Btree node. For regular doc requesting, they will be merged (still need for `_` prefix to avoid collisions?) and returned as single {...} as always. We could also return them as separate objects, so the view function becomes: function(doc, meta) {}. Couchbase does that and from my experience it works well and feel right. Cheers, Volker
Re: Erlang vs JavaScript
On Sun, Aug 18, 2013 at 3:54 PM, Volker Mische volker.mis...@gmail.com wrote: On 08/18/2013 08:42 AM, Alexander Shorin wrote: On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau bchesn...@gmail.com wrote: On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote: On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote: On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com wrote: On 08/16/2013 11:32 AM, Alexander Shorin wrote: On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote: I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. What if we split document metadata from document itself? I would like to hear a goal for this effort? What is the definition of success and failure? Idea: move document metadata into separate object. How do you link the metadata to the separate object there? Do you let the application set the internal links? I'm +1 with such idea anyway. Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!): Btree: + || --+----+-- || || ** ** At the node we have doc object {...} for specific revision. Instead of this, we'll have a tuple ({...}, {...}) - first is a meta, second is a data. So I think there wouldn't be needed internal links since meta and data would live within same Btree node. For regular doc requesting, they will be merged (still need for `_` prefix to avoid collisions?) and returned as single {...} as always. We could also return them as separate objects, so the view function becomes: function(doc, meta) {}. Couchbase does that and from my experience it works well and feel right. Oh, so this idea even works (: However, the trick was about to not pass doc part (in case if it big enough) to the view server until view server wouldn't process his metadata. Otherwise this is good feature, but it wouldn't help with indexing speed up. I remind the trick: first process meta part and if it passed - load the doc. Later I'd sent another mail where I'd eventually reinvented chained views, because trick with meta does exactly the same, chained views are more correct way to go. See quote at the end with resume. Anyway, I feel we need to inherit Couchbase experience with document's metadata object (of course if they wouldn't sue us for that ((: ) since everyone already same some preferred metadata fields (like type) or uses special object for that to not pollute main document body. I'm prefer special '.meta' object at the document root which holds document type info, authorship, timestamps, bindings, etc. It's good feature to have no matter does it optimizes indexation process or not (: Below is about chained views: On Fri, Aug 16, 2013 at 11:58 PM, Alexander Shorin kxe...@gmail.com wrote: Resume: probably, I'd just described chained views feature with autoindexing by certain fields (: Removing autoindexing feature and we could make views building process much more faster if we make right views chain which will use set algebra operations to calculate target doc ids to pass to final view: reduce docs before map results: { views: { posts: {map: ..., reduce: ...}, chain: [ [by_type, {key: post}], [hidden, {key: false}], [by_domain, {keys: [public, wiki]}] ] } } In case of 1 docs db with 1200 posts where 200 are hidden and 400 are private, result view posts have to process only 600 docs instead of 1 and it's index lookup operation to find out the result docs to pass. Sure, calling such view triggers all views in the chain. -- ,,,^..^,,,
Re: Erlang vs JavaScript
On 13-08-18 09:33 AM, Alexander Shorin wrote: On Sun, Aug 18, 2013 at 3:54 PM, Volker Mische volker.mis...@gmail.com wrote: On 08/18/2013 08:42 AM, Alexander Shorin wrote: On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau bchesn...@gmail.com wrote: On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin kxe...@gmail.com wrote: On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote: On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com wrote: On 08/16/2013 11:32 AM, Alexander Shorin wrote: On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote: I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. What if we split document metadata from document itself? I would like to hear a goal for this effort? What is the definition of success and failure? Idea: move document metadata into separate object. How do you link the metadata to the separate object there? Do you let the application set the internal links? I'm +1 with such idea anyway. Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!): Btree: + || --+----+-- || || ** ** At the node we have doc object {...} for specific revision. Instead of this, we'll have a tuple ({...}, {...}) - first is a meta, second is a data. So I think there wouldn't be needed internal links since meta and data would live within same Btree node. For regular doc requesting, they will be merged (still need for `_` prefix to avoid collisions?) and returned as single {...} as always. We could also return them as separate objects, so the view function becomes: function(doc, meta) {}. Couchbase does that and from my experience it works well and feel right. Oh, so this idea even works (: However, the trick was about to not pass doc part (in case if it big enough) to the view server until view server wouldn't process his metadata. Otherwise this is good feature, but it wouldn't help with indexing speed up. I remind the trick: first process meta part and if it passed - load the doc. Later I'd sent another mail where I'd eventually reinvented chained views, because trick with meta does exactly the same, chained views are more correct way to go. See quote at the end with resume. Anyway, I feel we need to inherit Couchbase experience with document's metadata object (of course if they wouldn't sue us for that ((: ) since everyone already same some preferred metadata fields (like type) or uses special object for that to not pollute main document body. I'm prefer special '.meta' object at the document root which holds document type info, authorship, timestamps, bindings, etc. It's good feature to have no matter does it optimizes indexation process or not (: I would suggest either prefixing with an underscore, or the use of a separate object passed to the view server. If someone ( such as myself ) has many many documents, which happen to contain a meta attribute, it would be non-trivial to upgrade / migrate. A migration script could be written of course, although it wouldn't be ideal; Something to consider, it may be worth while to simply use obj._meta instead of .meta. Below is about chained views: On Fri, Aug 16, 2013 at 11:58 PM, Alexander Shorin kxe...@gmail.com wrote: Resume: probably, I'd just described chained views feature with autoindexing by certain fields (: Removing autoindexing feature and we could make views building process much more faster if we make right views chain which will use set algebra operations to calculate target doc ids to pass to final view: reduce docs before map results: { views: { posts: {map: ..., reduce: ...}, chain: [ [by_type, {key: post}], [hidden, {key: false}], [by_domain, {keys: [public, wiki]}] ] } } In case of 1 docs db with 1200 posts where 200 are hidden and 400 are private, result view posts have to process only 600 docs instead of 1 and it's index lookup operation to find out the result docs to pass. Sure, calling such view triggers all views in the chain. Chained views would be awesome! I'm sure I'm not alone in having solved this problem by using multiple queries and matching document IDs.
Re: Erlang vs JavaScript
At 11:49 16/08/2013, Volker Mische wrote: What if we split document metadata from document itself? E.g. pass _id, _rev and other system or meta fields with separate object. Their size much lesser than whole document, so it will be possible to fast decode this metadata and decide is doc need to be processed or not without need to decode/encode megabytes of document's json. Sure, this adds additional communication roundtrip, but in case if it will be faster than json decode/encode - why not? That would be the ultimate-ultimate goal. This is a basic requirement for me: incrementally (i.e. metadata on metadata) and for syllodata (data between data) interlinks. jfc
Re: Erlang vs JavaScript
On 08/15/2013 11:53 AM, Benoit Chesneau wrote: On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt j...@apache.org wrote: On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote: A big +1 to Jason's clarification of erlang vs native. CouchDB could have shipped an erlang view server that worked in a separate process and had the stdio overhead, to combine the slowness of the protocol with the obtuseness of erlang. ;) Evaluating Javascript within the erlang VM process intrigues me, Jens, how is that done in your case? I've not previously found the assertion that V8 would be faster than SpiderMonkey for a view server compelling since the bottleneck is almost never in the code evaluation, but I do support CouchDB switching to it for the synergy effects of a closer binding with node.js, but if it's running in the same process, that would change (though I don't immediately see why the same couldn't be done for SpiderMonkey). Off the top of my head, I don't know a safe way to evaluate JS in the VM. A NIF-based approach would either be quite elaborate or would trip all the scheduling problems that long-running NIF's are now notorious for. At a step removed, the view server protocol itself seems like the thing to improve on, it feels like that's the principal bottleneck. The code is here: https://github.com/couchbase/couchdb/tree/master/src/mapreduce I’d love for someone to pick this up and give CouchDB, say, a ./configure --enable-native-v8 option or a plugin that allows people to opt into the speed improvements made there. :) The choice for V8 was made because of easier integration API and more reliable releases as a standalone project, which I think was a smart move. IIRC it relies on a change to CouchDB-y internals that has not made it back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading this thread), but we should look into that and get us “native JS views”, at least as an option or plugin. CCing dev@. Jan -- Well on the first hand nifs look like a good idea but can be very problematic: - when the view computation take time it would block the full vm scheduling. It can be mitigated using a pool of threads to execute the work asynchronously but then can create other problems like memory leaking etc. - nifs can't be upgraded easily during hot upgrade - when a nif crash, all the vm crash. (Note that we have the same problem when using a nif to decode/encode json, it only works well with medium sized documents) One other way to improve the js handling would be removing the main bottleneck ie the serialization-deserialization we do on each step. Not sure if it exists but feasible, why not passing erlang terms from erlang to js and js to erlang? So at the end the deserialization would happen only on the JS side ie instead of having get erlang term encode to json send to js decode json process encode json send json decode json to erlang term store we sould just have get erlang term send over STDIO decode erlang term to JS object process encode to erlang term send erlang term store Erlang serialization is also very optimised. I think the ultimate goal should be to be as little conversion/serialisation as possible, hence no conversion to Erlang Terms at all. Input as string Parsing to get ID Store as string Send to JS as string Process with JS Store as string Cheers, Volker
Re: Erlang vs JavaScript
On Fri, Aug 16, 2013 at 11:05 AM, Volker Mische volker.mis...@gmail.comwrote: On 08/15/2013 11:53 AM, Benoit Chesneau wrote: On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt j...@apache.org wrote: On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote: A big +1 to Jason's clarification of erlang vs native. CouchDB could have shipped an erlang view server that worked in a separate process and had the stdio overhead, to combine the slowness of the protocol with the obtuseness of erlang. ;) Evaluating Javascript within the erlang VM process intrigues me, Jens, how is that done in your case? I've not previously found the assertion that V8 would be faster than SpiderMonkey for a view server compelling since the bottleneck is almost never in the code evaluation, but I do support CouchDB switching to it for the synergy effects of a closer binding with node.js, but if it's running in the same process, that would change (though I don't immediately see why the same couldn't be done for SpiderMonkey). Off the top of my head, I don't know a safe way to evaluate JS in the VM. A NIF-based approach would either be quite elaborate or would trip all the scheduling problems that long-running NIF's are now notorious for. At a step removed, the view server protocol itself seems like the thing to improve on, it feels like that's the principal bottleneck. The code is here: https://github.com/couchbase/couchdb/tree/master/src/mapreduce I’d love for someone to pick this up and give CouchDB, say, a ./configure --enable-native-v8 option or a plugin that allows people to opt into the speed improvements made there. :) The choice for V8 was made because of easier integration API and more reliable releases as a standalone project, which I think was a smart move. IIRC it relies on a change to CouchDB-y internals that has not made it back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading this thread), but we should look into that and get us “native JS views”, at least as an option or plugin. CCing dev@. Jan -- Well on the first hand nifs look like a good idea but can be very problematic: - when the view computation take time it would block the full vm scheduling. It can be mitigated using a pool of threads to execute the work asynchronously but then can create other problems like memory leaking etc. - nifs can't be upgraded easily during hot upgrade - when a nif crash, all the vm crash. (Note that we have the same problem when using a nif to decode/encode json, it only works well with medium sized documents) One other way to improve the js handling would be removing the main bottleneck ie the serialization-deserialization we do on each step. Not sure if it exists but feasible, why not passing erlang terms from erlang to js and js to erlang? So at the end the deserialization would happen only on the JS side ie instead of having get erlang term encode to json send to js decode json process encode json send json decode json to erlang term store we sould just have get erlang term send over STDIO decode erlang term to JS object process encode to erlang term send erlang term store Erlang serialization is also very optimised. I think the ultimate goal should be to be as little conversion/serialisation as possible, hence no conversion to Erlang Terms at all. Input as string Parsing to get ID Store as string Send to JS as string Process with JS Store as string Cheers, Volker I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. - benoit
Re: Erlang vs JavaScript
On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote: I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. What if we split document metadata from document itself? E.g. pass _id, _rev and other system or meta fields with separate object. Their size much lesser than whole document, so it will be possible to fast decode this metadata and decide is doc need to be processed or not without need to decode/encode megabytes of document's json. Sure, this adds additional communication roundtrip, but in case if it will be faster than json decode/encode - why not? -- ,,,^..^,,,
Re: Erlang vs JavaScript
On 08/16/2013 11:32 AM, Alexander Shorin wrote: On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote: I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. What if we split document metadata from document itself? E.g. pass _id, _rev and other system or meta fields with separate object. Their size much lesser than whole document, so it will be possible to fast decode this metadata and decide is doc need to be processed or not without need to decode/encode megabytes of document's json. Sure, this adds additional communication roundtrip, but in case if it will be faster than json decode/encode - why not? That would be the ultimate-ultimate goal. Cheers, Volker
Re: Erlang vs JavaScript
On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.comwrote: On 08/16/2013 11:32 AM, Alexander Shorin wrote: On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote: I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. What if we split document metadata from document itself? I would like to hear a goal for this effort? What is the definition of success and failure? Jan makes a fine point on user@. I live with the pain. But really, life is pain. Deny it if you must. Until we are delivered--finally!--our sweet release, we will necessarily endure pain. Facts: * When you store a record, a machine must write that to storage * If you have an index, a machine must update the index to storage Building an index requires visiting every document. One way or another, the entire .couch file is coming off the disk and going through the ringer. One way or another, every row in the view will be written. I am not clear why optimizing from N ms/doc to 1/2 N ms/doc will help, when you still have to read 30GB from storage, and write 30GB back. One one end the computer scientist says we cannot avoid the necessary time complexity. On the other end, the casual user says, if it is not instantaneous, then it hardly matters. That is, we have a problem of expectation management, not codec speed. Nobody expects MySQL's CREATE INDEX to finish in a flash, and nobody should expect that of a view. If somebody does set out to accelerate views, you're welcome. But I would ask: what is a successful optimization, and why? (Also, Noah, if you are out there, this is an example of the sort of thing I would put on the wiki but past bad experiences make me say can't be bothered.)
Re: Erlang vs JavaScript
On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith j...@apache.org wrote: On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische volker.mis...@gmail.com wrote: On 08/16/2013 11:32 AM, Alexander Shorin wrote: On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau bchesn...@gmail.com wrote: I agree, (modulo the fact that I would replace a string by a binary ;) but that would be only possible if we extract the metadata (_id, _rev) from the JSON so couchdb wouldn't have to decode the JSON to get them. Streaming json would also allows that but since there is no guaranty in the properties order of a JSON it would be less efficient. What if we split document metadata from document itself? I would like to hear a goal for this effort? What is the definition of success and failure? Idea: move document metadata into separate object. Motivation: Case 1: Small docs. No profit at all. More over, probably it's better to not split things there e.g. pass full doc if his size around some amount of megabytes. Case 2: Large docs. Profit in case when you have set right fields into metadata (like doc type, authorship, tags etc.) and filter first by this metadata - you have minimal memory footprint, you have less CPU load, rule fast accept - fast reject works perfectly. Side effect: it's possible to first filter by metadata and leave only required to process document ids. And if we known what and how many to process, we may make assumptions about parallel indexation. Side effect: it's possible to autoindex metadata on fly on document update without asking user to write (meta/by_type, meta/by_author, meta/by_update_time etc. viiews) . Sure, as much metadata you have as large base index will be. In 80% cases it will be no more than 4KB. Resume: probably, I'd just described chained views feature with autoindexing by certain fields (: Removing autoindexing feature and we could make views building process much more faster if we make right views chain which will use set algebra operations to calculate target doc ids to pass to final view: reduce docs before map results: { views: { posts: {map: ..., reduce: ...}, chain: [ [by_type, {key: post}], [hidden, {key: false}], [by_domain, {keys: [public, wiki]}] ] } } In case of 1 docs db with 1200 posts where 200 are hidden and 400 are private, result view posts have to process only 600 docs instead of 1 and it's index lookup operation to find out the result docs to pass. Sure, calling such view triggers all views in the chain. And I don't think about cross dependencies and loops for know. -- ,,,^..^,,,
Re: Erlang vs JavaScript
On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote: A big +1 to Jason's clarification of erlang vs native. CouchDB could have shipped an erlang view server that worked in a separate process and had the stdio overhead, to combine the slowness of the protocol with the obtuseness of erlang. ;) Evaluating Javascript within the erlang VM process intrigues me, Jens, how is that done in your case? I've not previously found the assertion that V8 would be faster than SpiderMonkey for a view server compelling since the bottleneck is almost never in the code evaluation, but I do support CouchDB switching to it for the synergy effects of a closer binding with node.js, but if it's running in the same process, that would change (though I don't immediately see why the same couldn't be done for SpiderMonkey). Off the top of my head, I don't know a safe way to evaluate JS in the VM. A NIF-based approach would either be quite elaborate or would trip all the scheduling problems that long-running NIF's are now notorious for. At a step removed, the view server protocol itself seems like the thing to improve on, it feels like that's the principal bottleneck. The code is here: https://github.com/couchbase/couchdb/tree/master/src/mapreduce I’d love for someone to pick this up and give CouchDB, say, a ./configure --enable-native-v8 option or a plugin that allows people to opt into the speed improvements made there. :) The choice for V8 was made because of easier integration API and more reliable releases as a standalone project, which I think was a smart move. IIRC it relies on a change to CouchDB-y internals that has not made it back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading this thread), but we should look into that and get us “native JS views”, at least as an option or plugin. CCing dev@. Jan -- B. On 15 August 2013 08:22, Jason Smith j...@apache.org wrote: Yes, to a first approximation, with a native view, CouchDB is basically running eval() on your code. In my example, I took advantage of this to build a nonstandard response to satisfy an application. (Instead of a 404, we sent a designated fallback document body.) But, if you accumulate the list in a native view, a JavaScript view, or a hypothetical Erlang view (i.e. a subprocess), from the operating system's perspective, the memory for that list will be allocated somewhere. Either the CouchDB process asks for X KB more memory, or its subprocess will ask for it. So I think the total system impact is probably low in practice. So I guess my point is not that native views are wrong, just they have a cost so you should weigh the cost/benefit for your own project. In the case of manage_couchdb, I wrote a JavaScript implementation; but since sometimes I have an emergency and I must find conflicts ASAP, I made an Erlang version because it is worth it. On Thu, Aug 15, 2013 at 2:05 PM, Stanley Iriele siriele...@gmail.comwrote: Whoa...OK...that I had no idea about...thanks for taking the time to go to that granularity, by the way. So does this mean that the process memory is shared? As apposed to living in its own space?.so if someone accumulates a large json object in a list function its chewing up couchdb's memory?... I guess I'm a little confused about what's in the same process and what isn't now On Aug 14, 2013 11:57 PM, Jason Smith j...@apache.org wrote: To me, an Erlang view is a view server which supports map, reduce, show, update, list, etc. functions in the Erlang language. (Basically it is implemented in Erlang.) A view server is a subprocess that runs beneath CouchDB which communicates with it over standard i/o. It is a different process in the operating system and only interfaces with the main server using the view server protocol (basically a bunch of JSON messages going back and forth). I do not know of an Erlang view server which works well and is currently maintained. A native view (shipped by CouchDB but disabled by default) is some corner-cutting. Code is evaluated directly by the primary CouchDB server. Since CouchDB is Erlang, the native query server is necessarily Erlang. The key difference is, your code is right there in the eye of the storm. You can call couch_server:open(some_db) and completely circumvent security and other invariants which CouchDB enforces. You can leak memory until the kernel OOM killer terminates CouchDB. It's not about the language, it's that is is running inside the CouchDB process. On Thu, Aug 15, 2013 at 1:36 PM, Stanley Iriele siriele...@gmail.com wrote: WaitI'm a tad confused here..Jason what is the difference between native views and Erlang views?... On Aug 14, 2013 11:16 PM, Jason Smith j...@apache.org wrote: Oh, also: They are **not** Erlang views. They are **native** views. We should emphasize the latter to remind ourselves about the security and
Re: Erlang vs JavaScript
On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt j...@apache.org wrote: On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote: A big +1 to Jason's clarification of erlang vs native. CouchDB could have shipped an erlang view server that worked in a separate process and had the stdio overhead, to combine the slowness of the protocol with the obtuseness of erlang. ;) Evaluating Javascript within the erlang VM process intrigues me, Jens, how is that done in your case? I've not previously found the assertion that V8 would be faster than SpiderMonkey for a view server compelling since the bottleneck is almost never in the code evaluation, but I do support CouchDB switching to it for the synergy effects of a closer binding with node.js, but if it's running in the same process, that would change (though I don't immediately see why the same couldn't be done for SpiderMonkey). Off the top of my head, I don't know a safe way to evaluate JS in the VM. A NIF-based approach would either be quite elaborate or would trip all the scheduling problems that long-running NIF's are now notorious for. At a step removed, the view server protocol itself seems like the thing to improve on, it feels like that's the principal bottleneck. The code is here: https://github.com/couchbase/couchdb/tree/master/src/mapreduce I’d love for someone to pick this up and give CouchDB, say, a ./configure --enable-native-v8 option or a plugin that allows people to opt into the speed improvements made there. :) The choice for V8 was made because of easier integration API and more reliable releases as a standalone project, which I think was a smart move. IIRC it relies on a change to CouchDB-y internals that has not made it back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading this thread), but we should look into that and get us “native JS views”, at least as an option or plugin. CCing dev@. Jan -- Well on the first hand nifs look like a good idea but can be very problematic: - when the view computation take time it would block the full vm scheduling. It can be mitigated using a pool of threads to execute the work asynchronously but then can create other problems like memory leaking etc. - nifs can't be upgraded easily during hot upgrade - when a nif crash, all the vm crash. (Note that we have the same problem when using a nif to decode/encode json, it only works well with medium sized documents) One other way to improve the js handling would be removing the main bottleneck ie the serialization-deserialization we do on each step. Not sure if it exists but feasible, why not passing erlang terms from erlang to js and js to erlang? So at the end the deserialization would happen only on the JS side ie instead of having get erlang term encode to json send to js decode json process encode json send json decode json to erlang term store we sould just have get erlang term send over STDIO decode erlang term to JS object process encode to erlang term send erlang term store Erlang serialization is also very optimised. Both solutions could co-exist, that may worh a try and benchmark each... - benoit
Re: Erlang vs JavaScript
On Aug 15, 2013, at 11:53 , Benoit Chesneau bchesn...@gmail.com wrote: On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt j...@apache.org wrote: On Aug 15, 2013, at 10:09 , Robert Newson rnew...@apache.org wrote: A big +1 to Jason's clarification of erlang vs native. CouchDB could have shipped an erlang view server that worked in a separate process and had the stdio overhead, to combine the slowness of the protocol with the obtuseness of erlang. ;) Evaluating Javascript within the erlang VM process intrigues me, Jens, how is that done in your case? I've not previously found the assertion that V8 would be faster than SpiderMonkey for a view server compelling since the bottleneck is almost never in the code evaluation, but I do support CouchDB switching to it for the synergy effects of a closer binding with node.js, but if it's running in the same process, that would change (though I don't immediately see why the same couldn't be done for SpiderMonkey). Off the top of my head, I don't know a safe way to evaluate JS in the VM. A NIF-based approach would either be quite elaborate or would trip all the scheduling problems that long-running NIF's are now notorious for. At a step removed, the view server protocol itself seems like the thing to improve on, it feels like that's the principal bottleneck. The code is here: https://github.com/couchbase/couchdb/tree/master/src/mapreduce I’d love for someone to pick this up and give CouchDB, say, a ./configure --enable-native-v8 option or a plugin that allows people to opt into the speed improvements made there. :) The choice for V8 was made because of easier integration API and more reliable releases as a standalone project, which I think was a smart move. IIRC it relies on a change to CouchDB-y internals that has not made it back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading this thread), but we should look into that and get us “native JS views”, at least as an option or plugin. CCing dev@. Jan -- Well on the first hand nifs look like a good idea but can be very problematic: - when the view computation take time it would block the full vm scheduling. It can be mitigated using a pool of threads to execute the work asynchronously but then can create other problems like memory leaking etc. - nifs can't be upgraded easily during hot upgrade - when a nif crash, all the vm crash. Yeah totally, hence making the whole thing an option. (Note that we have the same problem when using a nif to decode/encode json, it only works well with medium sized documents) One other way to improve the js handling would be removing the main bottleneck ie the serialization-deserialization we do on each step. Not sure if it exists but feasible, why not passing erlang terms from erlang to js and js to erlang? So at the end the deserialization would happen only on the JS side ie instead of having get erlang term encode to json send to js decode json process encode json send json decode json to erlang term store we sould just have get erlang term send over STDIO decode erlang term to JS object process encode to erlang term send erlang term store Erlang serialization is also very optimised. Both solutions could co-exist, that may worh a try and benchmark each... I think we just want both solutions period, the embedded one will still be faster, but potentially a little less stable, and the external view server one will be slower but extremely robust. Users should be able to choose between them :) Best Jan -- signature.asc Description: Message signed with OpenPGP using GPGMail