Re: PostgreSQL VS MongoDB: a use case comparison
On Wed, Nov 21, 2018 at 9:48 AM Thomas Kellerer wrote: > > Stephen Frost schrieb am 20.11.2018 um 18:28: > > Oh yes, having a dictionary would be a great start to reducing the size > > of the jsonb data, though it could then become a contention point if > > there's a lot of new values being inserted and such. Naturally there > > would also be a cost to pulling that data back out as well but likely it > > would be well worth the benefit of not having to store the field names > > repeatedly. > > There is an extension for a dictionary based JSONB compression: > > https://github.com/postgrespro/zson That was a 'toy' experiment. We did several experiments on jsonb compression and presented the results, for example, http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconf.us-2017.pdf Also, check this thread on custom compression https://www.postgresql.org/message-id/flat/CAF4Au4xop7FqhCKgabYWymUS0yUk9i%3DbonPnmVUBbpoKsFYnLA%40mail.gmail.com#d01913d3b939b472ea5b38912bf3cbe4 Now, there is YCSB-JSON benchmark available and it is worth to run it for postgres https://dzone.com/articles/ycsb-json-implementation-for-couchbase-and-mongodb We are pretty busy, so you may contribute. For better indexing we are working on parameters for opclasses and I really wanted to have it for PG12. http://www.sai.msu.su/~megera/postgres/talks/opclass_pgcon-2018.pdf My recommendation for testing performance - always run concurrent queries and distibution of queries should be for most cases zipfian (we have added to PG11). > > -- Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: PostgreSQL VS MongoDB: a use case comparison
Stephen Frost schrieb am 20.11.2018 um 18:28: > Oh yes, having a dictionary would be a great start to reducing the size > of the jsonb data, though it could then become a contention point if > there's a lot of new values being inserted and such. Naturally there > would also be a cost to pulling that data back out as well but likely it > would be well worth the benefit of not having to store the field names > repeatedly. There is an extension for a dictionary based JSONB compression: https://github.com/postgrespro/zson
Re: PostgreSQL VS MongoDB: a use case comparison
Greetings, * Merlin Moncure (mmonc...@gmail.com) wrote: > On Tue, Nov 20, 2018 at 11:28 AM Stephen Frost wrote: > > Oh yes, having a dictionary would be a great start to reducing the size > > of the jsonb data, though it could then become a contention point if > > there's a lot of new values being inserted and such. Naturally there > > would also be a cost to pulling that data back out as well but likely it > > would be well worth the benefit of not having to store the field names > > repeatedly. > > Yes, the biggest concern with a shared dictionary ought to be > concurrency type problems. Hmmm, I wonder if we could do something like have a dictionary per page.. Or perhaps based on some hash of the toast ID.. Not sure. :) Thanks! Stephen signature.asc Description: PGP signature
Re: PostgreSQL VS MongoDB: a use case comparison
On Tue, Nov 20, 2018 at 11:28 AM Stephen Frost wrote: > > Greetings, > > * Merlin Moncure (mmonc...@gmail.com) wrote: > > On Tue, Nov 20, 2018 at 10:43 AM Stephen Frost wrote: > > > * Merlin Moncure (mmonc...@gmail.com) wrote: > > > > On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost > > > > wrote: > > > > > Looks like a lot of the difference being seen and the comments made > > > > > about one being faster than the other are because one system is > > > > > compressing *everything*, while PG (quite intentionally...) only > > > > > compresses the data sometimes- once it hits the TOAST limit. That > > > > > likely also contributes to why you're seeing the on-disk size > > > > > differences that you are. > > > > > > > > Hm. It may be intentional, but is it ideal? Employing datum > > > > compression in the 1kb-8kb range with a faster but less compressing > > > > algorithm could give benefits. > > > > > > Well, pglz is actually pretty fast and not as good at compression as > > > other things. I could certainly see an argument for allowing a column > > > to always be (or at least attempted to be) compressed. > > > > > > There's been a lot of discussion around supporting alternative > > > compression algorithms but making that happen is a pretty big task. > > > > Yeah; pglz is closer to zlib. There's much faster stuff out > > there...Andres summed it up pretty well; > > https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de > > > > There are also some interesting discussions on jsonb specific > > discussion approaches. > > Oh yes, having a dictionary would be a great start to reducing the size > of the jsonb data, though it could then become a contention point if > there's a lot of new values being inserted and such. Naturally there > would also be a cost to pulling that data back out as well but likely it > would be well worth the benefit of not having to store the field names > repeatedly. Yes, the biggest concern with a shared dictionary ought to be concurrency type problems. merlin
Re: PostgreSQL VS MongoDB: a use case comparison
Greetings, * Merlin Moncure (mmonc...@gmail.com) wrote: > On Tue, Nov 20, 2018 at 10:43 AM Stephen Frost wrote: > > * Merlin Moncure (mmonc...@gmail.com) wrote: > > > On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost wrote: > > > > Looks like a lot of the difference being seen and the comments made > > > > about one being faster than the other are because one system is > > > > compressing *everything*, while PG (quite intentionally...) only > > > > compresses the data sometimes- once it hits the TOAST limit. That > > > > likely also contributes to why you're seeing the on-disk size > > > > differences that you are. > > > > > > Hm. It may be intentional, but is it ideal? Employing datum > > > compression in the 1kb-8kb range with a faster but less compressing > > > algorithm could give benefits. > > > > Well, pglz is actually pretty fast and not as good at compression as > > other things. I could certainly see an argument for allowing a column > > to always be (or at least attempted to be) compressed. > > > > There's been a lot of discussion around supporting alternative > > compression algorithms but making that happen is a pretty big task. > > Yeah; pglz is closer to zlib. There's much faster stuff out > there...Andres summed it up pretty well; > https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de > > There are also some interesting discussions on jsonb specific > discussion approaches. Oh yes, having a dictionary would be a great start to reducing the size of the jsonb data, though it could then become a contention point if there's a lot of new values being inserted and such. Naturally there would also be a cost to pulling that data back out as well but likely it would be well worth the benefit of not having to store the field names repeatedly. Then again, taken far enough, what you end up with are tables... :) Thanks! Stephen signature.asc Description: PGP signature
Re: PostgreSQL VS MongoDB: a use case comparison
On Tue, Nov 20, 2018 at 10:43 AM Stephen Frost wrote: > > Greetings, > > * Merlin Moncure (mmonc...@gmail.com) wrote: > > On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost wrote: > > > Looks like a lot of the difference being seen and the comments made > > > about one being faster than the other are because one system is > > > compressing *everything*, while PG (quite intentionally...) only > > > compresses the data sometimes- once it hits the TOAST limit. That > > > likely also contributes to why you're seeing the on-disk size > > > differences that you are. > > > > Hm. It may be intentional, but is it ideal? Employing datum > > compression in the 1kb-8kb range with a faster but less compressing > > algorithm could give benefits. > > Well, pglz is actually pretty fast and not as good at compression as > other things. I could certainly see an argument for allowing a column > to always be (or at least attempted to be) compressed. > > There's been a lot of discussion around supporting alternative > compression algorithms but making that happen is a pretty big task. Yeah; pglz is closer to zlib. There's much faster stuff out there...Andres summed it up pretty well; https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de There are also some interesting discussions on jsonb specific discussion approaches. merlin
Re: PostgreSQL VS MongoDB: a use case comparison
Greetings, * Merlin Moncure (mmonc...@gmail.com) wrote: > On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost wrote: > > Looks like a lot of the difference being seen and the comments made > > about one being faster than the other are because one system is > > compressing *everything*, while PG (quite intentionally...) only > > compresses the data sometimes- once it hits the TOAST limit. That > > likely also contributes to why you're seeing the on-disk size > > differences that you are. > > Hm. It may be intentional, but is it ideal? Employing datum > compression in the 1kb-8kb range with a faster but less compressing > algorithm could give benefits. Well, pglz is actually pretty fast and not as good at compression as other things. I could certainly see an argument for allowing a column to always be (or at least attempted to be) compressed. There's been a lot of discussion around supporting alternative compression algorithms but making that happen is a pretty big task. Thanks! Stephen signature.asc Description: PGP signature
Re: PostgreSQL VS MongoDB: a use case comparison
On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost wrote: > Looks like a lot of the difference being seen and the comments made > about one being faster than the other are because one system is > compressing *everything*, while PG (quite intentionally...) only > compresses the data sometimes- once it hits the TOAST limit. That > likely also contributes to why you're seeing the on-disk size > differences that you are. Hm. It may be intentional, but is it ideal? Employing datum compression in the 1kb-8kb range with a faster but less compressing algorithm could give benefits. merlin
Re: PostgreSQL VS MongoDB: a use case comparison
Hi again, On 11/20/18 2:34 PM, Stephen Frost wrote: >> I agree with you the compression is playing a role in the comparison. >> Probably there is a toll to pay when the load is high and the CPU >> stressed from de/compressing data. If we will be able to bring our >> studies that further, this is definitely something we would like to measure. > > I was actually thinking of the compression as having more of an impact > with regard to the 'cold' cases because you're pulling fewer blocks when > it's compressed. The decompression cost on CPU is typically much, much > less than the cost to pull the data off of the storage medium. When > things are 'hot' and in cache then it might be interesting to question > if the compression/decompression is worth the cost. > true. When data is present in RAM, Postgres then is faster, because as you say the compression will not actually give a benefit on retrieving data from disk. In my statement here above about the CPU I was speculating if the speed Mongo gains thanks to the blocks compression, would act as a double edged sword under warm cache scenarios and heavy load. >> I also agree with you that at the moment Postgres really shines on >> relational data. To be honest, after seeing the outcome of our research, >> we are actually considering to decouple some (or all) fields from their >> JSON structure. There will be a toll to be payed there too, since we are >> receiving data in JSON format. > > PostgreSQL has tools to help with this, you might look into > 'json_to_record' and friends. > it might turn out useful to us if we normalize our data, thanks. >> Anyway, to bring data from JSON to a relational model is out of topic >> for the current discussion, since we are actually questioning if >> Postgres is a good replacement for Mongo when handling JSON data. > > This narrow viewpoint isn't really sensible though- what you should be > thinking about is what's appropriate for your *data*. JSON is just a > data format, and while it's alright as a system inter-exchange format, > it's rather terrible as a storage format. > I did not want to narrow the viewpoint. I'm exploring possibilities. Since Postgres supports JSON, it would have been nice to know how far one can go in storing data without transforming it. When we started our research the only question was: Is it possible to replace Postgres with Mongo 1 to 1? All other considerations came after, and as matter of fact, as told already, we are actually considering to (maybe partially) transform data to a relational model. Maybe we did not look around enough but we did not find on internet all the answers to our questions, therefore we initiated something ourselves. >> As per sharing the dataset, as mentioned in the post we are handling >> medical data. Even if the content is anonymized, we are not keen to >> share the data structure too for security reasons. > > If you really want people to take your analysis seriously, others must > be able to reproduce your results. I certainly appreciate that there > are very good reasons that you can't share this actual data, but your > testing could be done with completely generated data which happens to be > similar in structure to your data and have similar frequency of values. > > The way to approach generating such a data set would be to aggregate up > the actual data to a point where the appropriate committee/board agree > that it can be shared publicly, and then you build a randomly generated > set of data which aggregates to the same result and then use that for > testing. > Probably looking backward, I would generate data that is sharable with everybody to give the opportunity to play with it and involve people more. The fact is that we started very small and we ended up with quite a bunch of information we felt like sharing. Time is tyrant and at the moment we cannot re-run everything with sharable data so we all have to live with it. It is not optimal and is not perfectly academic but is still better than not sharing at all in my opinion. One good thing is that while testing and learning I found a similar investigation which led to similar results (unfortunately also there you can argue that is not sharing dataset and scripts and all the rest). In the jsquery section of the blog post there is a link pointing to: https://github.com/postgrespro/jsquery/blob/master/README.md which in turn points to http://www.sai.msu.su/~megera/postgres/talks/pgconfeu-2014-jsquery.pdf At page 18 there are some results which are close to what we obtained. I think those results are close to what we found even if the paper is from 2014 and a lot changed in the landscape. This to say that i suspect that if we generate random JSON data, we will probably draw the same conclusions. >> That's a pity I know but i cannot do anything about it. >> The queries we ran and the commands we used are mentioned in the blog >> post but if you see gaps, feel free to ask. >
Re: PostgreSQL VS MongoDB: a use case comparison
Greetings, * Fabio Pardi (f.pa...@portavita.eu) wrote: > thanks for your feedback. We prefer on these mailing lists to not top-post but instead to reply inline, as I'm doing here. This helps the conversation by eliminating unnecessary dialogue and being able to make comments regarding specific points clearly. > I agree with you the compression is playing a role in the comparison. > Probably there is a toll to pay when the load is high and the CPU > stressed from de/compressing data. If we will be able to bring our > studies that further, this is definitely something we would like to measure. I was actually thinking of the compression as having more of an impact with regard to the 'cold' cases because you're pulling fewer blocks when it's compressed. The decompression cost on CPU is typically much, much less than the cost to pull the data off of the storage medium. When things are 'hot' and in cache then it might be interesting to question if the compression/decompression is worth the cost. > I also agree with you that at the moment Postgres really shines on > relational data. To be honest, after seeing the outcome of our research, > we are actually considering to decouple some (or all) fields from their > JSON structure. There will be a toll to be payed there too, since we are > receiving data in JSON format. PostgreSQL has tools to help with this, you might look into 'json_to_record' and friends. > And the toll will be in time spent to deliver such a solution, and > indeed time spent by the engine in doing the conversion. It might not be > that convenient after all. Oh, the kind of reduction you'd see in space from both an on-disk and in-memory footprint would almost certainly be worth the tiny amount of CPU overhead from this. > Anyway, to bring data from JSON to a relational model is out of topic > for the current discussion, since we are actually questioning if > Postgres is a good replacement for Mongo when handling JSON data. This narrow viewpoint isn't really sensible though- what you should be thinking about is what's appropriate for your *data*. JSON is just a data format, and while it's alright as a system inter-exchange format, it's rather terrible as a storage format. > As per sharing the dataset, as mentioned in the post we are handling > medical data. Even if the content is anonymized, we are not keen to > share the data structure too for security reasons. If you really want people to take your analysis seriously, others must be able to reproduce your results. I certainly appreciate that there are very good reasons that you can't share this actual data, but your testing could be done with completely generated data which happens to be similar in structure to your data and have similar frequency of values. The way to approach generating such a data set would be to aggregate up the actual data to a point where the appropriate committee/board agree that it can be shared publicly, and then you build a randomly generated set of data which aggregates to the same result and then use that for testing. > That's a pity I know but i cannot do anything about it. > The queries we ran and the commands we used are mentioned in the blog > post but if you see gaps, feel free to ask. There were a lot of gaps that I saw when I looked through the article- starting with things like the actual CREATE TABLE command you used, and the complete size/structure of the JSON object, but really what a paper like this should include is a full script which creates all the tables, loads all the data, runs the analysis, calculates the results, etc. Thanks! Stephen signature.asc Description: PGP signature
Re: PostgreSQL VS MongoDB: a use case comparison
Hi Stephen, thanks for your feedback. I agree with you the compression is playing a role in the comparison. Probably there is a toll to pay when the load is high and the CPU stressed from de/compressing data. If we will be able to bring our studies that further, this is definitely something we would like to measure. I also agree with you that at the moment Postgres really shines on relational data. To be honest, after seeing the outcome of our research, we are actually considering to decouple some (or all) fields from their JSON structure. There will be a toll to be payed there too, since we are receiving data in JSON format. And the toll will be in time spent to deliver such a solution, and indeed time spent by the engine in doing the conversion. It might not be that convenient after all. Anyway, to bring data from JSON to a relational model is out of topic for the current discussion, since we are actually questioning if Postgres is a good replacement for Mongo when handling JSON data. As per sharing the dataset, as mentioned in the post we are handling medical data. Even if the content is anonymized, we are not keen to share the data structure too for security reasons. That's a pity I know but i cannot do anything about it. The queries we ran and the commands we used are mentioned in the blog post but if you see gaps, feel free to ask. regards, fabio pardi On 11/19/18 6:26 PM, Stephen Frost wrote: > Greetings, > > * Fabio Pardi (f.pa...@portavita.eu) wrote: >> We are open to any kind of feedback and we hope you enjoy the reading. > > Looks like a lot of the difference being seen and the comments made > about one being faster than the other are because one system is > compressing *everything*, while PG (quite intentionally...) only > compresses the data sometimes- once it hits the TOAST limit. That > likely also contributes to why you're seeing the on-disk size > differences that you are. > > Of course, if you want to see where PG will really shine, you'd stop > thinking of data as just blobs of JSON and actually define individual > fields in PG instead of just one 'jsonb' column, especially when you > know that field will always exist (which is obviously the case if you're > building an index on it, such as your MarriageDate) and then remove > those fields from the jsonb and just reconstruct the JSON when you > query. Doing that you'll get the size down dramatically. > > And that's without even going to that next-level stuff of actual > normalization where you pull out duplicate data from across the JSON > and have just one instance of that data in another, smaller, table and > use a JOIN to bring it all back together. Even better is when you > realize that then you only have to update one row in this other table > when something changes in that subset of data, unlike when you > repeatedly store that data in individual JSON entries all across the > system and such a change requires rewriting every single JSON object in > the entire system... > > Lastly, as with any performance benchmark, please include full details- > all scripts used, all commands run, all data used, so that others can > reproduce your results. I'm sure it'd be fun to take your json data and > create actual tables out of it and see what it'd be like then. > > Thanks! > > Stephen > signature.asc Description: OpenPGP digital signature
Re: PostgreSQL VS MongoDB: a use case comparison
Greetings, * Fabio Pardi (f.pa...@portavita.eu) wrote: > We are open to any kind of feedback and we hope you enjoy the reading. Looks like a lot of the difference being seen and the comments made about one being faster than the other are because one system is compressing *everything*, while PG (quite intentionally...) only compresses the data sometimes- once it hits the TOAST limit. That likely also contributes to why you're seeing the on-disk size differences that you are. Of course, if you want to see where PG will really shine, you'd stop thinking of data as just blobs of JSON and actually define individual fields in PG instead of just one 'jsonb' column, especially when you know that field will always exist (which is obviously the case if you're building an index on it, such as your MarriageDate) and then remove those fields from the jsonb and just reconstruct the JSON when you query. Doing that you'll get the size down dramatically. And that's without even going to that next-level stuff of actual normalization where you pull out duplicate data from across the JSON and have just one instance of that data in another, smaller, table and use a JOIN to bring it all back together. Even better is when you realize that then you only have to update one row in this other table when something changes in that subset of data, unlike when you repeatedly store that data in individual JSON entries all across the system and such a change requires rewriting every single JSON object in the entire system... Lastly, as with any performance benchmark, please include full details- all scripts used, all commands run, all data used, so that others can reproduce your results. I'm sure it'd be fun to take your json data and create actual tables out of it and see what it'd be like then. Thanks! Stephen signature.asc Description: PGP signature