Re: JSON-LD upgrade - impact on Elephas

Rob Vesse Tue, 13 Oct 2015 06:23:15 -0700

If the counts are different purely because we are failing in a different
(but predictable) way then I see no reason not to change them


Rob

On 09/10/2015 16:42, "Andy Seaborne" <[email protected]> wrote:

>Rob - Would changing the count results be acceptable?
>
>       Andy
>
>On 05/10/15 13:22, Andy Seaborne wrote:
>> On 05/10/15 09:31, Rob Vesse wrote:
>>> Yes the tests are designed to be pragmatic
>>>
>>> If you are processing large amounts of data on Hadoop there are two
>>> cases:
>>>
>>> - You want to skip/ignore bad data
>>> - You want to fail fast on bad data
>>>
>>> The failing tests are presumably the ones testing the second case.
>>
>> The failing tests are:
>>
>> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDTripleInputTest
>>
>> single_input_05
>> java.lang.AssertionError: expected:<50> but was:<0>
>>
>> multiple_inputs_02
>> java.lang.AssertionError: expected:<10150> but was:<10100>
>>
>> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDQuadInputTest
>>
>> single_input_05
>> java.lang.AssertionError: expected:<50> but was:<0>
>>
>> multiple_inputs_02
>> java.lang.AssertionError: expected:<10150> but was:<10100>
>>
>> so 2 tests, repeated.
>>
>> See also JENA-1013 which was previous work done in this area - JSON-LD
>> Elephas tests were not failing when they were supposed to.
>>
>>> My
>>> general hacky approach to testing that is simply to generate some valid
>>> data followed by some junk data.  If we change to the JSON-LD behaviour
>>> then those tests in Elephas that cover JSON-LD will need to change to
>>> generate a valid JSON object that happens to be invalid wrt. JSON-LD
>>>but
>>> since I don't know JSON-LD (and have zero desire to learn) I don't know
>>> what we'd need to generate to do that
>>
>> No need to learn anything about JSON-LD.  My knowledge of how Hadoop
>> processing works in the presence of failures isn't very strong.
>>
>> The tests already generate bad data by adding the trailing text "junk
>> data" to a valid document - same for all formats.  JSON-LD does not have
>> (and never has) the partial set of triples case that other formats have.
>> But the Elephas tests don't test for that anyway - the only bad data is
>> with the trailing string "junk data".
>>
>> So the issue is that the JSON-LD processor we use has a particular
>> failure mode (which is correct for JSON-LD according to that community)
>> that makes those two abstract tests need different answers for JSON-LD.
>>   Would changing the count results be acceptable?
>>
>> This looks like the long-term solution that leads to the least
>> maintenance.  We can retain our own code with its different
>> characteristics but then we have to maintain it and probably get the
>> occasional question as to why Jena is different in behaviour to other
>> systems.
>>
>>      Andy
>>
>>>
>>> Rob
>>>
>>> On 04/10/2015 10:02, "Andy Seaborne" <[email protected]> wrote:
>>>
>>>> Claude,
>>>>
>>>> The point is more on the pragmatic side than the ideal design with a
>>>> tradeoff between maintaining our own code vs using a maintained
>>>>library.
>>>>
>>>> The jsonld-java parsing process isn't streaming in either use case so
>>>> it's not a case of some triples read from the input.  The jsonld-java
>>>> process is layered, not streamed - all the JSON parsing is done, then
>>>> the conversion to RDF happens.
>>>>
>>>> The two processes are:
>>>>
>>>> (Jena calling low level, non-API calls of jsonld-java):
>>>> 1a/ Parse JSON
>>>> 2a/ Do all triples
>>>> 3a/ Check for trailing junk
>>>>
>>>> vs
>>>>
>>>> (jsonld-java API)
>>>> 1b/ Parse JSON
>>>> 2b/ Check for trailing junk
>>>> 3b/ Do all triples
>>>>
>>>> I am wondering if the Elephas tests are tuned to the way Jena works in
>>>> these error cases, rather than relying on a feature of it.
>>>>
>>>>     Andy
>>>>
>>>> AbstractWholeFileQuadInputFormatTests
>>>>
>>>> On 04/10/15 09:19, Claude Warren wrote:
>>>>> not Rob but my 2 cents.....
>>>>>
>>>>> I think that when we read turtle documents if there is an error the
>>>>> triples
>>>>> we have already read and left in the graph/model (yes, transactions
>>>>>can
>>>>> change this).  Shouldn't all parsers follow the same pattern?
>>>>>
>>>>> Currently that pattern seems to be:  read until eof or error and
>>>>> process
>>>>> what was read.
>>>>>
>>>>> Unless I am wrong about the above, I think that the JSON parser
>>>>>should
>>>>> return the json object that was parsed before the junk.
>>>>>
>>>>>
>>>>> Claude
>>>>>
>>>>> On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <[email protected]>
>>>>>wrote:
>>>>>
>>>>>> Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix
>>>>>> (jsonld-java issue 144) that Jena has a workaround for.
>>>>>>
>>>>>> The issue is that the Jackson JSON parser does not flag trailing
>>>>>>junk.
>>>>>> It
>>>>>> reads the JSON object and stops there.  Worse, it creates a buffered
>>>>>> reader
>>>>>> so the caller can't handle the stream afterwards.
>>>>>>
>>>>>> ---------------
>>>>>> {
>>>>>>     "@id" : "http://example/s";,
>>>>>>     "http://example/p"; : "str"
>>>>>> }
>>>>>> xxxxxxxxxxxxxxx
>>>>>> ---------------
>>>>>>
>>>>>> Jena (JsonLdReader) contains code taken from jsonld-java and
>>>>>>modified
>>>>>> to
>>>>>> run the Jackson JSON parser, produce triples and then check for
>>>>>> trailing
>>>>>> junk.  The detect end of junk was contributed back to the project.
>>>>>>PR
>>>>>> 145.
>>>>>>
>>>>>> jsonld-java treats it more systematically.
>>>>>>
>>>>>> If the JSON is syntactically bad in the {}, no triples merge. The
>>>>>> process
>>>>>> is completely read the JSON object then let the RDF conversion run.
>>>>>> Bad
>>>>>> object -> no RDF at all.
>>>>>>
>>>>>> If there is trailing junk, it is detected before passing up the JSON
>>>>>> object so trailing junk, no triples unlike Jena currently.
>>>>>>
>>>>>> I had hoped to remove the workaround and not duplicate jsonld-java
>>>>>> code.
>>>>>>
>>>>>> Elephas testing is impacted. It is sensitive to the "JSON object,
>>>>>> trailing
>>>>>> junk, triples" vs "JSON object, triples, trailing junk" differences.
>>>>>>
>>>>>> Unless there is a specific reason to support that behaviour, I'd
>>>>>>like
>>>>>> to
>>>>>> switch to jsonld-java behaviour.
>>>>>>
>>>>>> (Rob) Thoughts?
>>>>>>
>>>>>>           Andy
>>>>>>
>>>>>> [1] https://github.com/jsonld-java/jsonld-java/issues/144
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>

Re: JSON-LD upgrade - impact on Elephas

Reply via email to