Re: RDFStream to RDFConnection

Claude Warren Tue, 09 Jul 2019 04:17:19 -0700

In my case one document is 2 million triples.  I set a default batch size
of 1000 (I think -- I don't have the code in front of me) but that is
overrideable as a constructor parameter.  More work to determine what the
proper default batch size is.


Internally I send the triples/quads to a dataset and after the batch size
is reached (or on finish()) send the dataset to the RDFConnection.  It is a
simplistic implementation but one that seems to work for my case.

Claude



On Tue, Jul 9, 2019 at 11:09 AM Andy Seaborne <[email protected]> wrote:

> Claude,
>
> How many triples does processing one XML document produce?  There seem
> to be several ways to get a batching/buffering effect including current
> code. e.g send the StreamRDF to a graph, then send the graph over the
> RDFConnection?
>
> One of the nuisances of HTTP is the need to have payloads that are
> correct for both request and response.  Otherwise streaming direct to
> the Fuseki server would be nice but it needs to allow for request-side
> abort. In fact, if you do a GSP requests and stream the body and the
> request has a parse error it will abort but forcing a parse error
> because the request side found a higher level condition that means it
> wants to stop (e.g. the user presses cancel) is pretty ugly.
>
> For SPARQL 1.2, I've suggested developing websockets protocol so that
> interactions with the server can be more sophisticated but that's a long
> way off yet.
>
>      Andy
>
> On 08/07/2019 17:56, Claude Warren wrote:
> > The case I was trying to solve was reading a largish XML document and
> > converting it to an RDF graph.  After a few iterations I ended up
> writing a
> > custom Sax parser that calls the RDFStream triple/quad methods.  But I
> > wanted a way to update a Fuseki server so RDFConnection seemed like the
> > natural choice.
> >
> > In some recent work for my employer I found that I like the RDFConneciton
> > as the same code can work against a local dataset or a remote one.
> >
> > Claude
> >
> > On Mon, Jul 8, 2019 at 4:34 PM ajs6f <[email protected]> wrote:
> >
> >> This "replay" buffer approach was the direction I first went in for TIM,
> >> until turning to MVCC (speaking of MVCC, that code is probably
> somewhere,
> >> since we don't squash when we merge). Looking back, one thing that
> helped
> >> me move on was the potential effect of very large transactions. But in a
> >> controlled situation like Claude's, that problem wouldn't arise.
> >>
> >> ajs6f
> >>
> >>> On Jul 8, 2019, at 11:07 AM, Andy Seaborne <[email protected]> wrote:
> >>>
> >>> Claude,
> >>>
> >>> Good timing!
> >>>
> >>> This is what RDF Delta does and for updates rather than just StreamRDF
> >> additions though its not to an RDFConnection - it's to a patch service.
> >>>
> >>> With hindsight, I wonder if that woudl have been better as
> >> BufferingDatasetGraph - a DSG that keeps changes and makes the view of
> the
> >> buffer and underlying DatasetGraph behave correctly (find* works and has
> >> the right cardinality of results). Its a bit fiddley to get it all right
> >> but once it works it is a building block that has a lot of re-usability.
> >>>
> >>> I came across this with the SHACL work for a BufferingGraph (with
> >> prefixes) give "abort" of transactions to simple graphs which aren't
> >> transactional.
> >>>
> >>> But it occurs in Fuseki with complex dataset set ups like rules.
> >>>
> >>>     Andy
> >>>
> >>> On 08/07/2019 11:09, Claude Warren wrote:
> >>>> I have written an RDFStream to RDFConnection with caching.  Basically,
> >> the
> >>>> stream caches triples/quads until a limit is reached and then it
> writes
> >>>> them to the RDFConnection.  At finish it writes any triples/quads in
> the
> >>>> cache to the RDFConnection.
> >>>> Internally I cache the stream in a dataset.  I write triples to the
> >> default
> >>>> dataset and quads as appropriate.
> >>>> I have a couple of questions:
> >>>> 1) In this arrangement what does the "base" tell me? I currently
> ignore
> >> it
> >>>> and want to make sure I havn't missed something.
> >>>
> >>> The parser saw a BASE statement.
> >>>
> >>> Like PREFIX, in Turtle, it can happen mid-file (e.g. when files are
> >> concatenated).
> >>>
> >>> Its not necessary because the data stream should have resolved IRIs in
> >> it so base is used in a stream.
> >>>
> >>>> 2) I capture all the prefix calls in a PrefixMapping that is
> accessible
> >>>> from the RDFConnectionStream class.  They are not passed into the
> >> dataset
> >>>> in any way.  I didn't see any method to do so and don't really think
> it
> >> is
> >>>> needed.  Does anyone see a problem with this?
> >>>> 3) Does anyone have a use for this class?  If so I am happy to
> >> contribute
> >>>> it, though the next question becomes what module to put it in?
> Perhaps
> >> we
> >>>> should have an extras package for RDFStream implementations?
> >>>> Claude
> >>
> >>
> >
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: RDFStream to RDFConnection

Reply via email to