I wrote some passably reasonable code to persist JSON to disk (using its hash as a key) as per an off-list suggestion.
You can see it here: https://github.com/assimilation/assimilation-official/blob/rel_2_dev/cma/invariant_data.py The base code is very straightforward. It has lots of comments, and a little extra sugar to make it more Pythonic - which more than doubled the number of lines . Fortunately, it didn't make it much more complex - just more bulky. The code to write the data is shown below: try: os_openmode = os.O_EXCL+os.O_CREAT+os.O_WRONLY with os.fdopen(os.open(pathname, os_openmode, self.filemode), 'w') as file_obj: file_obj.write(value) file_obj.flush() if not self.delayed_sync: os.fsync(file_obj.fileno()) # Make sure the bits get saved for real... except OSError as oopsie: if oopsie.errno == errno.EEXIST: # Already exists pass elif oopsie.errno == errno.ENOENT: # Directory is missing... self._create_missing_directories(os.path.join(self.root_directory, key[:self.hash_chars])) # Try again (recursively)... self.put(value, key) else: raise oopsie return key If delayed_sync is enabled, then you should call the object.sync() to do an "end transaction" type thing. It does a syncfs() on the filesystem. It's a bit faster to do a syncfs() on the filesystem once than several separate fsync() calls. We ignore EEXIST because if the data exists, it's the same data - so we don't need to write it again. The search code isn't right yet. Searching will be slower than Postgres for most cases, but faster than the current code. -- Alan Robertson al...@unix.sh On Mon, Feb 12, 2018, at 7:12 PM, Alan Robertson wrote: > Two less heavyweight ways to store the JSON data from some off-list > discussions: > > 1) Just store each JSON value in a flat file. Since the "keys" are > already hash values, one could imagine a directory structure that looks > a bit like this: > > <top-level> > <discovery-type> > <first-3-hex characters-of-key> # > That's 16^3 (4096) possible subdirectories > <full-key-filename> > [JSON-string-in-full-key-filename] > This supports up to 16M different values for each discovery type > while not putting more than 4096 entries in > any directory, and only two layers of directories. > Since they're hash keys, they should spread across the > subdirectories nicely. > It will generate a _lot_ of inodes for big sites. > Since the contents are invariant (which makes them idempotent), > a simple fsync is likely enough to ensure > data integrity and substitute for transactions. > I wrote a bare-bones version of this code last night... It's as > simple as it sounds... > > 2) Store them as S3 objects (if you're using AWS). Slightly concerned > about the transaction-like semantics > but I suspect AWS/S3 is pretty reliable... > > So, that's a total of 4 possible methods of storing the data... > (a) in Neo4j, (b) in Postgres, (c) in flat files, (d) in S3. Of > those four, only Postgres will provide indexing. > If on the other hand, we break out big JSON blobs into individual nodes > in Neo4j, we can search that > instead (as Michael Hunger suggested). > > Even if we break it out into separate nodes in Neo4j for searching, we > still need to keep the original JSON around. > > The flat file code I wrote is a subclass of a class that's intended to > hold 3 or 4 of these alternative implementations. > > This way if we change our mind, it will be comparatively straightforward > to convert from one to another. Given the incredibly simple semantics of > the data, even doing the conversion on a live system wouldn't be that > hard... > > Right now, I'm leaning towards flat files - understanding that it has > costs and potential complexities (particularly the number of inodes). > > -- > Alan Robertson > al...@unix.sh > > On Fri, Feb 9, 2018, at 2:27 PM, Welch, Bill wrote: > > Just intuition here, but embedding postrges inside Assimilation seems > > heavy for just JSON. Of course, you already embed neo4j so you know how > > to handle having a dependency on a large subsystem. > > > > Then, there's mongodb vs postgres vs ...: > > https://www.sisense.com/blog/postgres-vs-mongodb-for-storing-json-data/ > > https://www.quora.com/Which-is-better-storing-json-objects-in-json-files-in-Redis-or-MongoDB-RethinkDB > > > > On 9/2/18, 10:05, "Assimilation on behalf of Alan Robertson" > > <assimilation-boun...@lists.community.tummy.com on behalf of > > al...@unix.sh> wrote: > > > > EXTERNAL EMAIL – Use caution with any links or file attachments. > > > > This data is generated by a variety of shell scripts that do > > discovery - potentially dozens of them - and each is different. Some of > > the most critical data is decomposed to attributes - but not most of it. > > > > -- > > Alan Robertson > > al...@unix.sh > > > > On Fri, Feb 9, 2018, at 2:58 AM, Michael Hunger wrote: > > > I think this is ok. > > > I wished we had full document support yet. > > > > > > I know that pg has really good jsonb support, so go for it. > > > > > > Did you ever try to destrucure the data into properties? Not sure > > how > > > deeply nested it is? And leave off all that are just defaults > > > > > > Von meinem iPhone gesendet > > > > > > > Am 09.02.2018 um 04:10 schrieb Alan Robertson <al...@unix.sh>: > > > > > > > > Hi, > > > > > > > > There is one set of data that when I insert it into Neo4j - it's > > really, really slow. It's discovery data - which is JSON - and sometimes > > very large - a few megabytes. Many of them are smallish, but having > > items a few kilobytes is common, and dozens of kilobytes is also common, > > and some few things are in the megabyte+ range. [Because of compression, > > I can send up to 3 megabytes of this JSON over UDP]. > > > > > > > > There are a few things I can do with Neo4j to make inserting it > > faster, but I don't think a lot -- and when I get done, the data is very > > hard to query against (it involves regexes against unindexed data, and > > is a performance nightmare). > > > > > > > > Postgres has JSON support, and it has real transactions and a > > reputation for being very solid. I did some benchmarking and it is a a > > couple of orders of magnitude faster than Neo4j with both of them > > untuned. In addition, Postgres JSON (jsonb) can have indexes over the > > JSON information - greatly improving the query capabilities over what > > Neo4j can do for this same data. > > > > > > > > I'm not thinking about doing anything except moving this one > > class of data to Postgres. This particular class of data is also > > idempotent, which has advantages when you have multiple databases > > involved... > > > > > > > > Since this particular type of data is its own object in the > > Python, having it be in Postgres wouldn't likely be horrible to > > implement. > > > > > > > > If I'm going to do this in the next year or two, it makes sense > > to couple it with the rest of the backwards-incompatible changes I'm > > already putting into release 2. > > > > > > > > Does anyone think this is a show-stopper to use two databases? > > > > > > > > -- > > > > Alan Robertson > > > > al...@unix.sh > > _______________________________________________ > > Assimilation mailing list - Discovery-Driven Monitoring > > Assimilation@lists.community.tummy.com > > > > http://lists.community.tummy.com/cgi-bin/mailman/listinfo/assimilation > > http://assimmon.org/ > > > > > > Notice: This e-mail message, together with any attachments, contains > > information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth, > > New Jersey, USA 07033), and/or its affiliates Direct contact information > > for affiliates is available at > > http://www.merck.com/contact/contacts.html) that may be confidential, > > proprietary copyrighted and/or legally privileged. It is intended solely > > for the use of the individual or entity named on this message. If you are > > not the intended recipient, and have received this message in error, > > please notify us immediately by reply e-mail and then delete it from > > your system. > > _______________________________________________ > > Assimilation mailing list - Discovery-Driven Monitoring > > Assimilation@lists.community.tummy.com > > http://lists.community.tummy.com/cgi-bin/mailman/listinfo/assimilation > > http://assimmon.org/ _______________________________________________ Assimilation mailing list - Discovery-Driven Monitoring Assimilation@lists.community.tummy.com http://lists.community.tummy.com/cgi-bin/mailman/listinfo/assimilation http://assimmon.org/