If the counts are different purely because we are failing in a different (but predictable) way then I see no reason not to change them
Rob On 09/10/2015 16:42, "Andy Seaborne" <[email protected]> wrote: >Rob - Would changing the count results be acceptable? > > Andy > >On 05/10/15 13:22, Andy Seaborne wrote: >> On 05/10/15 09:31, Rob Vesse wrote: >>> Yes the tests are designed to be pragmatic >>> >>> If you are processing large amounts of data on Hadoop there are two >>> cases: >>> >>> - You want to skip/ignore bad data >>> - You want to fail fast on bad data >>> >>> The failing tests are presumably the ones testing the second case. >> >> The failing tests are: >> >> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDTripleInputTest >> >> single_input_05 >> java.lang.AssertionError: expected:<50> but was:<0> >> >> multiple_inputs_02 >> java.lang.AssertionError: expected:<10150> but was:<10100> >> >> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDQuadInputTest >> >> single_input_05 >> java.lang.AssertionError: expected:<50> but was:<0> >> >> multiple_inputs_02 >> java.lang.AssertionError: expected:<10150> but was:<10100> >> >> so 2 tests, repeated. >> >> See also JENA-1013 which was previous work done in this area - JSON-LD >> Elephas tests were not failing when they were supposed to. >> >>> My >>> general hacky approach to testing that is simply to generate some valid >>> data followed by some junk data. If we change to the JSON-LD behaviour >>> then those tests in Elephas that cover JSON-LD will need to change to >>> generate a valid JSON object that happens to be invalid wrt. JSON-LD >>>but >>> since I don't know JSON-LD (and have zero desire to learn) I don't know >>> what we'd need to generate to do that >> >> No need to learn anything about JSON-LD. My knowledge of how Hadoop >> processing works in the presence of failures isn't very strong. >> >> The tests already generate bad data by adding the trailing text "junk >> data" to a valid document - same for all formats. JSON-LD does not have >> (and never has) the partial set of triples case that other formats have. >> But the Elephas tests don't test for that anyway - the only bad data is >> with the trailing string "junk data". >> >> So the issue is that the JSON-LD processor we use has a particular >> failure mode (which is correct for JSON-LD according to that community) >> that makes those two abstract tests need different answers for JSON-LD. >> Would changing the count results be acceptable? >> >> This looks like the long-term solution that leads to the least >> maintenance. We can retain our own code with its different >> characteristics but then we have to maintain it and probably get the >> occasional question as to why Jena is different in behaviour to other >> systems. >> >> Andy >> >>> >>> Rob >>> >>> On 04/10/2015 10:02, "Andy Seaborne" <[email protected]> wrote: >>> >>>> Claude, >>>> >>>> The point is more on the pragmatic side than the ideal design with a >>>> tradeoff between maintaining our own code vs using a maintained >>>>library. >>>> >>>> The jsonld-java parsing process isn't streaming in either use case so >>>> it's not a case of some triples read from the input. The jsonld-java >>>> process is layered, not streamed - all the JSON parsing is done, then >>>> the conversion to RDF happens. >>>> >>>> The two processes are: >>>> >>>> (Jena calling low level, non-API calls of jsonld-java): >>>> 1a/ Parse JSON >>>> 2a/ Do all triples >>>> 3a/ Check for trailing junk >>>> >>>> vs >>>> >>>> (jsonld-java API) >>>> 1b/ Parse JSON >>>> 2b/ Check for trailing junk >>>> 3b/ Do all triples >>>> >>>> I am wondering if the Elephas tests are tuned to the way Jena works in >>>> these error cases, rather than relying on a feature of it. >>>> >>>> Andy >>>> >>>> AbstractWholeFileQuadInputFormatTests >>>> >>>> On 04/10/15 09:19, Claude Warren wrote: >>>>> not Rob but my 2 cents..... >>>>> >>>>> I think that when we read turtle documents if there is an error the >>>>> triples >>>>> we have already read and left in the graph/model (yes, transactions >>>>>can >>>>> change this). Shouldn't all parsers follow the same pattern? >>>>> >>>>> Currently that pattern seems to be: read until eof or error and >>>>> process >>>>> what was read. >>>>> >>>>> Unless I am wrong about the above, I think that the JSON parser >>>>>should >>>>> return the json object that was parsed before the junk. >>>>> >>>>> >>>>> Claude >>>>> >>>>> On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <[email protected]> >>>>>wrote: >>>>> >>>>>> Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix >>>>>> (jsonld-java issue 144) that Jena has a workaround for. >>>>>> >>>>>> The issue is that the Jackson JSON parser does not flag trailing >>>>>>junk. >>>>>> It >>>>>> reads the JSON object and stops there. Worse, it creates a buffered >>>>>> reader >>>>>> so the caller can't handle the stream afterwards. >>>>>> >>>>>> --------------- >>>>>> { >>>>>> "@id" : "http://example/s", >>>>>> "http://example/p" : "str" >>>>>> } >>>>>> xxxxxxxxxxxxxxx >>>>>> --------------- >>>>>> >>>>>> Jena (JsonLdReader) contains code taken from jsonld-java and >>>>>>modified >>>>>> to >>>>>> run the Jackson JSON parser, produce triples and then check for >>>>>> trailing >>>>>> junk. The detect end of junk was contributed back to the project. >>>>>>PR >>>>>> 145. >>>>>> >>>>>> jsonld-java treats it more systematically. >>>>>> >>>>>> If the JSON is syntactically bad in the {}, no triples merge. The >>>>>> process >>>>>> is completely read the JSON object then let the RDF conversion run. >>>>>> Bad >>>>>> object -> no RDF at all. >>>>>> >>>>>> If there is trailing junk, it is detected before passing up the JSON >>>>>> object so trailing junk, no triples unlike Jena currently. >>>>>> >>>>>> I had hoped to remove the workaround and not duplicate jsonld-java >>>>>> code. >>>>>> >>>>>> Elephas testing is impacted. It is sensitive to the "JSON object, >>>>>> trailing >>>>>> junk, triples" vs "JSON object, triples, trailing junk" differences. >>>>>> >>>>>> Unless there is a specific reason to support that behaviour, I'd >>>>>>like >>>>>> to >>>>>> switch to jsonld-java behaviour. >>>>>> >>>>>> (Rob) Thoughts? >>>>>> >>>>>> Andy >>>>>> >>>>>> [1] https://github.com/jsonld-java/jsonld-java/issues/144 >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >>> >> >
