On 05/10/15 09:31, Rob Vesse wrote:
Yes the tests are designed to be pragmatic

If you are processing large amounts of data on Hadoop there are two cases:

- You want to skip/ignore bad data
- You want to fail fast on bad data

The failing tests are presumably the ones testing the second case.

The failing tests are:

org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDTripleInputTest

single_input_05
java.lang.AssertionError: expected:<50> but was:<0>

multiple_inputs_02
java.lang.AssertionError: expected:<10150> but was:<10100>

org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDQuadInputTest

single_input_05
java.lang.AssertionError: expected:<50> but was:<0>

multiple_inputs_02
java.lang.AssertionError: expected:<10150> but was:<10100>

so 2 tests, repeated.

See also JENA-1013 which was previous work done in this area - JSON-LD Elephas tests were not failing when they were supposed to.

My
general hacky approach to testing that is simply to generate some valid
data followed by some junk data.  If we change to the JSON-LD behaviour
then those tests in Elephas that cover JSON-LD will need to change to
generate a valid JSON object that happens to be invalid wrt. JSON-LD but
since I don't know JSON-LD (and have zero desire to learn) I don't know
what we'd need to generate to do that

No need to learn anything about JSON-LD. My knowledge of how Hadoop processing works in the presence of failures isn't very strong.

The tests already generate bad data by adding the trailing text "junk data" to a valid document - same for all formats. JSON-LD does not have (and never has) the partial set of triples case that other formats have. But the Elephas tests don't test for that anyway - the only bad data is with the trailing string "junk data".

So the issue is that the JSON-LD processor we use has a particular failure mode (which is correct for JSON-LD according to that community) that makes those two abstract tests need different answers for JSON-LD. Would changing the count results be acceptable?

This looks like the long-term solution that leads to the least maintenance. We can retain our own code with its different characteristics but then we have to maintain it and probably get the occasional question as to why Jena is different in behaviour to other systems.

        Andy


Rob

On 04/10/2015 10:02, "Andy Seaborne" <[email protected]> wrote:

Claude,

The point is more on the pragmatic side than the ideal design with a
tradeoff between maintaining our own code vs using a maintained library.

The jsonld-java parsing process isn't streaming in either use case so
it's not a case of some triples read from the input.  The jsonld-java
process is layered, not streamed - all the JSON parsing is done, then
the conversion to RDF happens.

The two processes are:

(Jena calling low level, non-API calls of jsonld-java):
1a/ Parse JSON
2a/ Do all triples
3a/ Check for trailing junk

vs

(jsonld-java API)
1b/ Parse JSON
2b/ Check for trailing junk
3b/ Do all triples

I am wondering if the Elephas tests are tuned to the way Jena works in
these error cases, rather than relying on a feature of it.

        Andy

AbstractWholeFileQuadInputFormatTests

On 04/10/15 09:19, Claude Warren wrote:
not Rob but my 2 cents.....

I think that when we read turtle documents if there is an error the
triples
we have already read and left in the graph/model (yes, transactions can
change this).  Shouldn't all parsers follow the same pattern?

Currently that pattern seems to be:  read until eof or error and process
what was read.

Unless I am wrong about the above, I think that the JSON parser should
return the json object that was parsed before the junk.


Claude

On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <[email protected]> wrote:

Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix
(jsonld-java issue 144) that Jena has a workaround for.

The issue is that the Jackson JSON parser does not flag trailing junk.
It
reads the JSON object and stops there.  Worse, it creates a buffered
reader
so the caller can't handle the stream afterwards.

---------------
{
    "@id" : "http://example/s";,
    "http://example/p"; : "str"
}
xxxxxxxxxxxxxxx
---------------

Jena (JsonLdReader) contains code taken from jsonld-java and modified
to
run the Jackson JSON parser, produce triples and then check for
trailing
junk.  The detect end of junk was contributed back to the project.  PR
145.

jsonld-java treats it more systematically.

If the JSON is syntactically bad in the {}, no triples merge. The
process
is completely read the JSON object then let the RDF conversion run.
Bad
object -> no RDF at all.

If there is trailing junk, it is detected before passing up the JSON
object so trailing junk, no triples unlike Jena currently.

I had hoped to remove the workaround and not duplicate jsonld-java
code.

Elephas testing is impacted. It is sensitive to the "JSON object,
trailing
junk, triples" vs "JSON object, triples, trailing junk" differences.

Unless there is a specific reason to support that behaviour, I'd like
to
switch to jsonld-java behaviour.

(Rob) Thoughts?

          Andy

[1] https://github.com/jsonld-java/jsonld-java/issues/144










Reply via email to