In my case one document is 2 million triples. I set a default batch size of 1000 (I think -- I don't have the code in front of me) but that is overrideable as a constructor parameter. More work to determine what the proper default batch size is.
Internally I send the triples/quads to a dataset and after the batch size is reached (or on finish()) send the dataset to the RDFConnection. It is a simplistic implementation but one that seems to work for my case. Claude On Tue, Jul 9, 2019 at 11:09 AM Andy Seaborne <[email protected]> wrote: > Claude, > > How many triples does processing one XML document produce? There seem > to be several ways to get a batching/buffering effect including current > code. e.g send the StreamRDF to a graph, then send the graph over the > RDFConnection? > > One of the nuisances of HTTP is the need to have payloads that are > correct for both request and response. Otherwise streaming direct to > the Fuseki server would be nice but it needs to allow for request-side > abort. In fact, if you do a GSP requests and stream the body and the > request has a parse error it will abort but forcing a parse error > because the request side found a higher level condition that means it > wants to stop (e.g. the user presses cancel) is pretty ugly. > > For SPARQL 1.2, I've suggested developing websockets protocol so that > interactions with the server can be more sophisticated but that's a long > way off yet. > > Andy > > On 08/07/2019 17:56, Claude Warren wrote: > > The case I was trying to solve was reading a largish XML document and > > converting it to an RDF graph. After a few iterations I ended up > writing a > > custom Sax parser that calls the RDFStream triple/quad methods. But I > > wanted a way to update a Fuseki server so RDFConnection seemed like the > > natural choice. > > > > In some recent work for my employer I found that I like the RDFConneciton > > as the same code can work against a local dataset or a remote one. > > > > Claude > > > > On Mon, Jul 8, 2019 at 4:34 PM ajs6f <[email protected]> wrote: > > > >> This "replay" buffer approach was the direction I first went in for TIM, > >> until turning to MVCC (speaking of MVCC, that code is probably > somewhere, > >> since we don't squash when we merge). Looking back, one thing that > helped > >> me move on was the potential effect of very large transactions. But in a > >> controlled situation like Claude's, that problem wouldn't arise. > >> > >> ajs6f > >> > >>> On Jul 8, 2019, at 11:07 AM, Andy Seaborne <[email protected]> wrote: > >>> > >>> Claude, > >>> > >>> Good timing! > >>> > >>> This is what RDF Delta does and for updates rather than just StreamRDF > >> additions though its not to an RDFConnection - it's to a patch service. > >>> > >>> With hindsight, I wonder if that woudl have been better as > >> BufferingDatasetGraph - a DSG that keeps changes and makes the view of > the > >> buffer and underlying DatasetGraph behave correctly (find* works and has > >> the right cardinality of results). Its a bit fiddley to get it all right > >> but once it works it is a building block that has a lot of re-usability. > >>> > >>> I came across this with the SHACL work for a BufferingGraph (with > >> prefixes) give "abort" of transactions to simple graphs which aren't > >> transactional. > >>> > >>> But it occurs in Fuseki with complex dataset set ups like rules. > >>> > >>> Andy > >>> > >>> On 08/07/2019 11:09, Claude Warren wrote: > >>>> I have written an RDFStream to RDFConnection with caching. Basically, > >> the > >>>> stream caches triples/quads until a limit is reached and then it > writes > >>>> them to the RDFConnection. At finish it writes any triples/quads in > the > >>>> cache to the RDFConnection. > >>>> Internally I cache the stream in a dataset. I write triples to the > >> default > >>>> dataset and quads as appropriate. > >>>> I have a couple of questions: > >>>> 1) In this arrangement what does the "base" tell me? I currently > ignore > >> it > >>>> and want to make sure I havn't missed something. > >>> > >>> The parser saw a BASE statement. > >>> > >>> Like PREFIX, in Turtle, it can happen mid-file (e.g. when files are > >> concatenated). > >>> > >>> Its not necessary because the data stream should have resolved IRIs in > >> it so base is used in a stream. > >>> > >>>> 2) I capture all the prefix calls in a PrefixMapping that is > accessible > >>>> from the RDFConnectionStream class. They are not passed into the > >> dataset > >>>> in any way. I didn't see any method to do so and don't really think > it > >> is > >>>> needed. Does anyone see a problem with this? > >>>> 3) Does anyone have a use for this class? If so I am happy to > >> contribute > >>>> it, though the next question becomes what module to put it in? > Perhaps > >> we > >>>> should have an extras package for RDFStream implementations? > >>>> Claude > >> > >> > > > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren
