I wrote some passably reasonable code to persist JSON to disk (using its hash 
as a key) as per an off-list suggestion.

You can see it here: 
https://github.com/assimilation/assimilation-official/blob/rel_2_dev/cma/invariant_data.py

The base code is very straightforward. It has lots of comments, and a little 
extra sugar to make it more Pythonic - which more than doubled the number of 
lines . Fortunately, it didn't make it much more complex - just more bulky.

The code to write the data is shown below:

        try:
            os_openmode = os.O_EXCL+os.O_CREAT+os.O_WRONLY
            with os.fdopen(os.open(pathname, os_openmode, self.filemode), 'w') 
as file_obj:
                file_obj.write(value)
                file_obj.flush()
                if not self.delayed_sync:
                    os.fsync(file_obj.fileno())  # Make sure the bits get saved 
for real...
        except OSError as oopsie:
            if oopsie.errno == errno.EEXIST:  # Already exists
                pass
            elif oopsie.errno == errno.ENOENT:  # Directory is missing...
                
self._create_missing_directories(os.path.join(self.root_directory,
                                                              
key[:self.hash_chars]))
                # Try again (recursively)...
                self.put(value, key)
            else:
                raise oopsie
        return key

If delayed_sync is enabled, then you should call the object.sync() to do an 
"end transaction" type thing. It does a syncfs() on the filesystem. It's a bit 
faster to do a syncfs() on the filesystem once than several separate fsync() 
calls.

We ignore EEXIST because if the data exists, it's the same data - so we don't 
need to write it again.

The search code isn't right yet. Searching will be slower than Postgres for 
most cases, but faster than the current code.

-- 
  Alan Robertson
  al...@unix.sh

On Mon, Feb 12, 2018, at 7:12 PM, Alan Robertson wrote:
> Two less heavyweight ways to store the JSON data from some off-list 
> discussions:
> 
> 1) Just store each JSON value in a flat file. Since the "keys" are 
> already hash values, one could imagine a directory structure that looks 
> a bit like this:
> 
>      <top-level>
>             <discovery-type>
>                  <first-3-hex characters-of-key>                    # 
> That's 16^3 (4096) possible subdirectories
>                        <full-key-filename>
>                                 [JSON-string-in-full-key-filename]
>       This supports up to 16M different values for each discovery type 
> while not putting more than 4096 entries in
>        any directory, and only two layers of directories.
>        Since they're hash keys, they should spread across the 
> subdirectories nicely.
>         It will generate a _lot_ of inodes for big sites.
>         Since the contents are invariant (which makes them idempotent), 
> a simple fsync is likely enough to ensure
>         data integrity and substitute for transactions.
>         I wrote a bare-bones version of this code last night... It's as 
> simple as it sounds...
> 
> 2) Store them as S3 objects (if you're using AWS). Slightly concerned 
> about the transaction-like semantics
>      but I suspect AWS/S3 is pretty reliable... 
> 
> So, that's a total of 4 possible methods of storing the data...
>     (a) in Neo4j, (b) in Postgres, (c) in flat files, (d) in S3.  Of 
> those four, only Postgres will provide indexing.
> If on the other hand, we break out big JSON blobs into individual nodes 
> in Neo4j, we can search that
> instead (as Michael Hunger suggested).
> 
> Even if we break it out into separate nodes in Neo4j for searching, we 
> still need to keep the original JSON around.
> 
> The flat file code I wrote is a subclass of a class that's intended to 
> hold 3 or 4 of these alternative implementations.
> 
> This way if we change our mind, it will be comparatively straightforward 
> to convert from one to another. Given the incredibly simple semantics of 
> the data, even doing the conversion on a live system wouldn't be that 
> hard...
> 
> Right now, I'm leaning towards flat files - understanding that it has 
> costs and potential complexities (particularly the number of inodes).
> 
> -- 
>   Alan Robertson
>   al...@unix.sh
> 
> On Fri, Feb 9, 2018, at 2:27 PM, Welch, Bill wrote:
> > Just intuition here, but embedding postrges inside Assimilation seems 
> > heavy for just JSON. Of course, you already embed neo4j so you know how 
> > to handle having a dependency on a large subsystem.
> > 
> > Then, there's mongodb vs postgres vs ...:
> > https://www.sisense.com/blog/postgres-vs-mongodb-for-storing-json-data/
> > https://www.quora.com/Which-is-better-storing-json-objects-in-json-files-in-Redis-or-MongoDB-RethinkDB
> > 
> > On 9/2/18, 10:05, "Assimilation on behalf of Alan Robertson" 
> > <assimilation-boun...@lists.community.tummy.com on behalf of 
> > al...@unix.sh> wrote:
> > 
> >     EXTERNAL EMAIL – Use caution with any links or file attachments.
> >     
> >     This data is generated by a variety of shell scripts that do 
> > discovery -  potentially dozens of them - and each is different. Some of 
> > the most critical data is decomposed to attributes - but not most of it.
> >     
> >     -- 
> >       Alan Robertson
> >       al...@unix.sh
> >     
> >     On Fri, Feb 9, 2018, at 2:58 AM, Michael Hunger wrote:
> >     > I think this is ok. 
> >     > I wished we had full document support yet.  
> >     > 
> >     > I know that pg has really good jsonb support, so go for it. 
> >     > 
> >     > Did you ever try to destrucure the data into properties? Not sure 
> > how 
> >     > deeply nested it is? And leave off all that are just defaults 
> >     > 
> >     > Von meinem iPhone gesendet
> >     > 
> >     > > Am 09.02.2018 um 04:10 schrieb Alan Robertson <al...@unix.sh>:
> >     > > 
> >     > > Hi,
> >     > > 
> >     > > There is one set of data that when I insert it into Neo4j - it's 
> > really, really slow. It's discovery data - which is JSON - and sometimes 
> > very large - a few megabytes. Many of them are smallish, but having 
> > items a few kilobytes is common, and dozens of kilobytes is also common, 
> > and some few things are in the megabyte+ range. [Because of compression, 
> > I can send up to 3 megabytes of this JSON over UDP].
> >     > > 
> >     > > There are a few things I can do with Neo4j to make inserting it 
> > faster, but I don't think a lot -- and when I get done, the data is very 
> > hard to query against (it involves regexes against unindexed data, and 
> > is a performance nightmare).
> >     > > 
> >     > > Postgres has JSON support, and it has real transactions and a 
> > reputation for being very solid. I did some benchmarking and it is a a 
> > couple of orders of magnitude faster than Neo4j with both of them 
> > untuned. In addition, Postgres JSON (jsonb) can have indexes over the 
> > JSON information - greatly improving the query capabilities over what 
> > Neo4j can do for this same data.
> >     > > 
> >     > > I'm not thinking about doing anything except moving this one 
> > class of data to Postgres. This particular class of data is also 
> > idempotent, which has advantages when you have multiple databases 
> > involved...
> >     > > 
> >     > > Since this particular type of data is its own object in the 
> > Python, having it be in Postgres wouldn't likely be horrible to 
> > implement.
> >     > > 
> >     > > If I'm going to do this in the next year or two, it makes sense 
> > to couple it with the rest of the backwards-incompatible changes I'm 
> > already putting into release 2.
> >     > > 
> >     > > Does anyone think this is a show-stopper to use two databases?
> >     > > 
> >     > > -- 
> >     > >  Alan Robertson
> >     > >  al...@unix.sh
> >     _______________________________________________
> >     Assimilation mailing list - Discovery-Driven Monitoring
> >     Assimilation@lists.community.tummy.com
> >     
> > http://lists.community.tummy.com/cgi-bin/mailman/listinfo/assimilation
> >     http://assimmon.org/
> >     
> > 
> > Notice:  This e-mail message, together with any attachments, contains
> > information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
> > New Jersey, USA 07033), and/or its affiliates Direct contact information
> > for affiliates is available at 
> > http://www.merck.com/contact/contacts.html) that may be confidential,
> > proprietary copyrighted and/or legally privileged. It is intended solely
> > for the use of the individual or entity named on this message. If you are
> > not the intended recipient, and have received this message in error,
> > please notify us immediately by reply e-mail and then delete it from 
> > your system.
> > _______________________________________________
> > Assimilation mailing list - Discovery-Driven Monitoring
> > Assimilation@lists.community.tummy.com
> > http://lists.community.tummy.com/cgi-bin/mailman/listinfo/assimilation
> > http://assimmon.org/
_______________________________________________
Assimilation mailing list - Discovery-Driven Monitoring
Assimilation@lists.community.tummy.com
http://lists.community.tummy.com/cgi-bin/mailman/listinfo/assimilation
http://assimmon.org/

Reply via email to