Re: [PR] GH-2557: RDF term normalization [jena]

via GitHub Thu, 04 Jul 2024 02:00:42 -0700


rvesse commented on code in PR #2564:
URL: https://github.com/apache/jena/pull/2564#discussion_r1665351208



##########
jena-arq/src/main/java/org/apache/jena/riot/process/normalize/NormalizeRDFTerms.java:
##########
@@ -0,0 +1,278 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.riot.process.normalize;
+
+import java.util.HashMap ;
+import java.util.Map ;
+
+import org.apache.jena.datatypes.RDFDatatype ;
+import org.apache.jena.datatypes.xsd.XSDDatatype ;
+import org.apache.jena.graph.Node ;
+import org.apache.jena.graph.NodeFactory ;
+import org.apache.jena.riot.web.LangTag ;
+import org.apache.jena.sparql.util.NodeUtils ;
+import org.apache.jena.vocabulary.RDF ;
+
+/**
+ * Convert literals to normalized forms. Sometimes called canonicalization. 
There is
+ * one preferred RDFTerm for a given RDFterm "value" ("value" generalized to 
include
+ * URis and blank nodes). Mostly, this affects literals. Only certain 
datatypes are
+ * supported but applications can add normalization for other datatypes.
+ * <p>
+ * Various policies are provided:
+ * <ul>
+ * <li>{@link #get() General} (close to Turtle, use the turtle form for long 
term contract</li>
+ * <li>{@link #getTTL() Turtle}</li>
+ *  * <li>{@link #getXSD() XSD}</li>
+ * <li>XSD - follows XSD 1.1. Mantissa/exponents are adjusted. xsd;decimal of 
an
+ * integer value does not have a decimal point. and is not suitable for Turtle 
as a

Review Comment:
   ```suggestion
    * integer value does not have a decimal point, and is not suitable for 
Turtle as a
   ```



##########
jena-arq/src/main/java/org/apache/jena/riot/process/normalize/NormalizeRDFTerms.java:
##########
@@ -0,0 +1,278 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.riot.process.normalize;
+
+import java.util.HashMap ;
+import java.util.Map ;
+
+import org.apache.jena.datatypes.RDFDatatype ;
+import org.apache.jena.datatypes.xsd.XSDDatatype ;
+import org.apache.jena.graph.Node ;
+import org.apache.jena.graph.NodeFactory ;
+import org.apache.jena.riot.web.LangTag ;
+import org.apache.jena.sparql.util.NodeUtils ;
+import org.apache.jena.vocabulary.RDF ;
+
+/**
+ * Convert literals to normalized forms. Sometimes called canonicalization. 
There is
+ * one preferred RDFTerm for a given RDFterm "value" ("value" generalized to 
include
+ * URis and blank nodes). Mostly, this affects literals. Only certain 
datatypes are
+ * supported but applications can add normalization for other datatypes.
+ * <p>
+ * Various policies are provided:
+ * <ul>
+ * <li>{@link #get() General} (close to Turtle, use the turtle form for long 
term contract</li>

Review Comment:
   ```suggestion
    * <li>{@link #get() General} (close to Turtle, use the turtle form for long 
term contract)</li>
   ```



##########
jena-tdb2/src/main/java/org/apache/jena/tdb2/sys/NormalizeTermsTDB2.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.tdb2.sys;
+
+import org.apache.jena.graph.Node;
+import org.apache.jena.riot.process.StreamRDFApply;
+import org.apache.jena.riot.process.normalize.NormalizeRDFTerms;
+import org.apache.jena.riot.system.StreamRDF;
+import org.apache.jena.tdb2.store.NodeId;
+import org.apache.jena.tdb2.store.NodeIdInline;
+
+/**
+ * A variation on {@link NormalizeRDFTerms} that attempts to generate TDB2 
{@link NodeIdInline inline} NodeIds,
+ * then convert back to {@link Node Nodes}.
+ * This is used to ensure that TDB2 NodInlining is compatible with {@link 
NormalizeRDFTerms}

Review Comment:
   ```suggestion
    * This is used to ensure that TDB2 Node Inlining is compatible with {@link 
NormalizeRDFTerms}
   ```



##########
jena-arq/src/main/java/org/apache/jena/riot/process/normalize/NormalizeValue.java:
##########
@@ -128,80 +128,112 @@ class NormalizeValue
 
         if ( lex2.equals(lexicalForm) )
             return node ;
-        return NodeFactory.createLiteral(lex2, datatype) ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
     } ;
 
-    static DatatypeHandler dtDecimal = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+    static DatatypeHandler dtDecimalTTL = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+
+
+
+
         BigDecimal bd = new BigDecimal(lexicalForm).stripTrailingZeros() ;
         String lex2 = bd.toPlainString() ;
 
         // XSD canonical is "1"
         // but in Turtle the ".0" is need for short print form.
 
         // Ensure there is a "."
-        //if ( bd.scale() <= 0 )
         if ( lex2.indexOf('.') == -1 )
             // Must contain .0
             lex2 = lex2+".0" ;
         if ( lex2.equals(lexicalForm) )
             return node ;
-        return NodeFactory.createLiteral(lex2, datatype) ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
+    } ;
 
+    static DatatypeHandler dtDoubleTTL = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+        double d = XSDNumUtils.xsdParseDouble(lexicalForm) ;
+        String lex2 = XSDNumUtils.stringForm(d);
+        if ( lex2.equals(lexicalForm) )
+            return node ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
     } ;
 
-    static private DecimalFormatSymbols decimalNumberSymbols = new 
DecimalFormatSymbols(Locale.ROOT) ;
-    static private NumberFormat fmtFloatingPoint = new 
DecimalFormat("0.0#################E0", decimalNumberSymbols) ;
+    static DatatypeHandler dtFloatTTL = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+        float f = XSDNumUtils.xsdParseFloat(lexicalForm) ;
+        String lex2 = XSDNumUtils.stringForm(f);
+        if ( lex2.equals(lexicalForm) )
+            return node ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
+    } ;
 
-    /* http://www.w3.org/TR/xmlschema-2/#double-canonical-representation */
+    // --- XSD, more closely.
     /*
-     * The canonical representation for double is defined by prohibiting 
certain
-     * options from the Lexical representation (§3.2.5.1). Specifically, the
-     * exponent must be indicated by "E". Leading zeroes and the preceding
-     * optional "+" sign are prohibited in the exponent. If the exponent is
-     * zero, it must be indicated by "E0". For the mantissa, the preceding
-     * optional "+" sign is prohibited and the decimal point is required.
-     * Leading and trailing zeroes are prohibited subject to the following:
-     * number representations must be normalized such that there is a single
-     * digit which is non-zero to the left of the decimal point and at least a
-     * single digit to the right of the decimal point unless the value being
-     * represented is zero. The canonical representation for zero is 0.0E0.
+     * Format floats and double by using {@linkDecimalFormat}.
+     * This can move the decimal point and change the exponent value.
+     * All numbers are "n.nnnExxx".
+     * For "smaller"floats and double, Java formating as used by
+     * {@link XSDNumUtils#stringForm(double)} or {@link 
XSDNumUtils#stringForm(float)}
+     * leaves the number is "common" form, with the mantissa (significand) 
having the decimal point

Review Comment:
   ```suggestion
        * leaves the number in "common" form, with the mantissa (significand) 
having the decimal point
   ```



##########
jena-arq/src/main/java/org/apache/jena/riot/process/normalize/NormalizeRDFTerms.java:
##########
@@ -0,0 +1,278 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.riot.process.normalize;
+
+import java.util.HashMap ;
+import java.util.Map ;
+
+import org.apache.jena.datatypes.RDFDatatype ;
+import org.apache.jena.datatypes.xsd.XSDDatatype ;
+import org.apache.jena.graph.Node ;
+import org.apache.jena.graph.NodeFactory ;
+import org.apache.jena.riot.web.LangTag ;
+import org.apache.jena.sparql.util.NodeUtils ;
+import org.apache.jena.vocabulary.RDF ;
+
+/**
+ * Convert literals to normalized forms. Sometimes called canonicalization. 
There is
+ * one preferred RDFTerm for a given RDFterm "value" ("value" generalized to 
include
+ * URis and blank nodes). Mostly, this affects literals. Only certain 
datatypes are
+ * supported but applications can add normalization for other datatypes.
+ * <p>
+ * Various policies are provided:
+ * <ul>
+ * <li>{@link #get() General} (close to Turtle, use the turtle form for long 
term contract</li>
+ * <li>{@link #getTTL() Turtle}</li>
+ *  * <li>{@link #getXSD() XSD}</li>
+ * <li>XSD - follows XSD 1.1. Mantissa/exponents are adjusted. xsd;decimal of 
an
+ * integer value does not have a decimal point. and is not suitable for Turtle 
as a
+ * xsd;decimal short form.</li>
+ * <li>Turtle - produces decimals, floats and double suitable for Turtle 
short-form syntax.</li>
+ * </ul>
+ * <p>
+ * XSD Schema 1.1 does not define a canonical form for all cases.
+ * <p>
+ */
+
+public class NormalizeRDFTerms implements NormalizeTerm {
+
+    enum Style { General, XSD, XSD11, XSD10 }
+
+    /**
+     * Convert literals to normalized forms. A number of different policies are
+     * <p>
+     * Strictly, this is "normalization" - XSD Schema 1.1 does not define a 
canonical form for all cases.
+     * <p>
+     * <p>
+     * N.B. The normalization does produce forms for decimals and doubles that 
are correct as Turtle syntactic forms.
+     * For doubles, but not floats, zero is "0.0e0", whereas Java produces 
"0.0".
+     * For floats, the Java is returned for values with low precision.
+     *
+     */
+    private static final NormalizeRDFTerms mapGeneral   = mappingGeneral();
+    private static final NormalizeRDFTerms mapTTL       = mappingTTL();
+    private static final NormalizeRDFTerms mapXSD11     = mappingXSD11();
+    private static final NormalizeRDFTerms mapXSD10     = mappingXSD10();
+
+    /** General normalization. */
+    public static NormalizeRDFTerms get() { return mapGeneral ; }
+
+    /**
+     * Normalization for use in Turtle output syntax.
+     * <ul>
+     * <li>xsd:decimals always have a decimal point.</li>
+     * <li>xsd:doubles always have an exponent. For ones that are less that 
10E7, add
+     *     "e0", otherwise normalize the mantissa and have an exponent ('E').
+     * <li>xsd:floats For ones that are less that 10E7, just the decimal, no 
expoent.
+     *     Otherwise normalize the mantissa and have an expoent ('E').
+     * </ul>
+     The normalization does produce forms for decimals and doubles that are
+     * correct as Turtle syntactic forms. For doubles, but not floats, zero is 
"0.0e0",
+     * whereas Java produces "0.0". For floats, the Java is returned for 
values with low
+     * precision.
+     */
+    public static NormalizeRDFTerms getTTL() { return mapGeneral ; }
+
+    /**
+     * Normalization by XSD 1.1
+     * <ul>
+     * <li>xsd:double and xsd:float - the mantissa and exponent are adjusted 
based on
+     * value. The Exponent is 'E'.</li>
+     * <li>xsd;decimal - an integer value does not have a decimal point and 
may not be
+     * suitable for Turtle as a xsd:decimal short form.</li>
+     * </ul>
+     */
+    public static NormalizeRDFTerms getXSD() { return mapXSD11 ; }
+
+    /** Normalize based on XSD 1.1. */
+    public static NormalizeRDFTerms getXSD11() { return mapXSD11 ; }
+
+    /** Normalize based on XSD 1.0 where decimals always have  decimal point. 
*/
+    public static NormalizeRDFTerms getXSD10() { return mapXSD10 ; }
+
+    private static NormalizeRDFTerms mappingGeneral() {
+       Map<RDFDatatype, DatatypeHandler> mapping = baseMap();
+       return new NormalizeRDFTerms(mapping);
+    }
+
+    private static NormalizeRDFTerms mappingTTL() {
+        Map<RDFDatatype, DatatypeHandler> mapping = baseMap();
+        mapping.put(XSDDatatype.XSDdecimal, NormalizeValue.dtDecimalTTL);
+        mapping.put(XSDDatatype.XSDdouble, NormalizeValue.dtDoubleTTL);
+        mapping.put(XSDDatatype.XSDfloat, NormalizeValue.dtFloatTTL);
+        return new NormalizeRDFTerms(mapping);
+     }
+
+    private static NormalizeRDFTerms mappingXSD11() {
+        Map<RDFDatatype, DatatypeHandler> mapping = baseMap();
+        mapping.put(XSDDatatype.XSDdecimal, NormalizeValue.dtDecimalXSD);
+        mapping.put(XSDDatatype.XSDdouble, NormalizeValue.dtDoubleXSD);
+        mapping.put(XSDDatatype.XSDfloat, NormalizeValue.dtFloatXSD);
+        return new NormalizeRDFTerms(mapping);
+     }
+
+    private static NormalizeRDFTerms mappingXSD10() {
+        Map<RDFDatatype, DatatypeHandler> mapping = baseMap();
+        mapping.put(XSDDatatype.XSDdecimal, NormalizeValue.dtDecimalXSD10);
+        mapping.put(XSDDatatype.XSDdouble, NormalizeValue.dtDoubleXSD);
+        mapping.put(XSDDatatype.XSDfloat, NormalizeValue.dtFloatXSD);
+        return new NormalizeRDFTerms(mapping);
+     }
+
+    private final Map<RDFDatatype, DatatypeHandler> dispatchMapping;
+
+    private NormalizeRDFTerms(Map<RDFDatatype, DatatypeHandler> mapping) {
+        this.dispatchMapping = Map.copyOf(mapping);
+    }
+
+
+    /**
+     * Canonicalize a literal, both lexical form and language tag
+     */
+    public static Node normalizeValue(Node node) {
+        return get().normalize(node);
+    }
+
+    /**
+     * Canonicalize a literal, both lexical form and language tag
+     */
+    @Override
+    public Node normalize(Node node) {
+        return normalizeTerm(dispatchMapping, node);
+    }
+
+
+    /** Convert the lexical form to a canonical form if one of the known 
datatypes,
+     * otherwise return the node argument. (same object :: {@code ==})
+     */
+    static Node normalizeTerm(Map<RDFDatatype, DatatypeHandler> dispatchMap, 
Node node) {
+        if ( ! node.isLiteral() )
+            return node ;
+        if ( NodeUtils.isLangString(node) )
+            return canonicalLangtag(node);
+        if ( NodeUtils.isSimpleString(node) )
+            return node;
+        // Is it a valid value?
+        // (Can we do this in the normal case code?)
+        if ( ! node.getLiteralDatatype().isValid(node.getLiteralLexicalForm()) 
)
+            // Invalid lexical form for the datatype - do nothing.
+            return node;
+
+        RDFDatatype dt = node.getLiteralDatatype() ;
+        DatatypeHandler handler = dispatchMap.get(dt) ;
+        if ( handler == null )
+            return node ;
+        Node n2 = handler.handle(node, node.getLiteralLexicalForm(), dt) ;
+        if ( n2 == null )
+            return node ;
+        return n2 ;
+    }
+
+    /** Convert the language tag of a lexical form to a canonical form if one 
of the known datatypes,
+     * otherwise return the node argument. (same object; compare by {@code ==})
+     */
+    private static Node canonicalLangtag(Node node) {
+        String langTag = node.getLiteralLanguage();
+        String langTag2 = LangTag.canonical(langTag);
+        if ( langTag2.equals(langTag) )
+            return node;
+        //String textDir = n.getLiteralTextDirection();
+        String lexicalForm = node.getLiteralLexicalForm();
+        return NodeFactory.createLiteralLang(lexicalForm, langTag2);
+    }
+
+    private static final RDFDatatype dtPlainLiteral = 
NodeFactory.getType(RDF.PlainLiteral.getURI());
+
+    private static Map<RDFDatatype, DatatypeHandler> baseMap() {
+        // Nulls are not allowed in this map.
+        Map<RDFDatatype, DatatypeHandler> map = new HashMap<>();
+        addBaseAll(map);
+        return map;
+    }
+
+    private static Map<RDFDatatype, DatatypeHandler> turtleMap() {
+        Map<RDFDatatype, DatatypeHandler> map = new HashMap<>();
+        addBaseAll(map);
+        map.put(XSDDatatype.XSDdecimal,     NormalizeValue.dtDecimalTTL ) ;
+        map.put(XSDDatatype.XSDfloat,       NormalizeValue.dtFloatTTL ) ;
+        map.put(XSDDatatype.XSDdouble,      NormalizeValue.dtDoubleTTL ) ;
+        return map;
+    }
+
+    private static Map<RDFDatatype, DatatypeHandler> xsdMap() {
+        Map<RDFDatatype, DatatypeHandler> map = new HashMap<>();
+        addBaseAll(map);
+        map.put(XSDDatatype.XSDdecimal,     NormalizeValue.dtDecimalXSD ) ;
+        map.put(XSDDatatype.XSDfloat,       NormalizeValue.dtFloatXSD ) ;
+        map.put(XSDDatatype.XSDdouble,      NormalizeValue.dtDoubleXSD ) ;
+        return map;
+    }
+
+
+    /*
+     * Add the standard set of datatype handlers.
+     * This is the general a policy.

Review Comment:
   ```suggestion
        * This is the general policy.
   ```



##########
jena-arq/src/main/java/org/apache/jena/riot/process/normalize/NormalizeValueByCode.java:
##########
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.riot.process.normalize;
+
+import static org.apache.jena.atlas.lib.Chars.CH_DOT;
+import static org.apache.jena.atlas.lib.Chars.CH_MINUS;
+import static org.apache.jena.atlas.lib.Chars.CH_PLUS;
+import org.apache.jena.datatypes.RDFDatatype;
+import org.apache.jena.graph.Node;
+import org.apache.jena.graph.NodeFactory;
+
+/**
+ * Code to format integers an decimals by manipulating the string form

Review Comment:
   ```suggestion
    * Code to format integers and decimals by manipulating the string form
   ```



##########
jena-arq/src/main/java/org/apache/jena/sparql/util/XSDNumUtils.java:
##########
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.sparql.util;
+
+import java.math.BigDecimal;
+
+public class XSDNumUtils {
+
+    /**
+     * Parser an XSD double lexical form.
+     * Adds in the cases not covered by {@link Double#parseDouble}.
+     * {@code INF} is strictly upper case, but we accept lower case.
+     * {@code -NaN} and {@code +NaN} are not accepted.
+     */
+    public static double xsdParseDouble(String lexicalForm) {
+        // Generalized the lexical space. Java's Double.parseDouble does not 
cover everything.
+        return switch(lexicalForm ) {
+            case "INF",  "inf"  -> Double.POSITIVE_INFINITY;
+            case "+INF", "+inf" -> Double.POSITIVE_INFINITY;
+            case "-INF", "-inf" -> Double.NEGATIVE_INFINITY;
+            case "NaN"          -> Double.NaN ;
+            // Acceptable as Java doubles (value is "NaN" but not as xsd:double
+            case "-NaN"-> throw new NumberFormatException("-NaN is not valid 
as an xsd:double");
+            case "+NaN"-> throw new NumberFormatException("+NaN is not valid 
as an xsd:double");
+            // Includes +0 and -0.
+            default-> Double.parseDouble(lexicalForm);
+        };
+    }
+
+    /**
+     * Parser an XSD float lexical form.

Review Comment:
   ```suggestion
        * Parse an XSD float lexical form.
   ```



##########
jena-arq/src/main/java/org/apache/jena/riot/process/normalize/NormalizeValue.java:
##########
@@ -128,80 +128,112 @@ class NormalizeValue
 
         if ( lex2.equals(lexicalForm) )
             return node ;
-        return NodeFactory.createLiteral(lex2, datatype) ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
     } ;
 
-    static DatatypeHandler dtDecimal = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+    static DatatypeHandler dtDecimalTTL = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+
+
+
+
         BigDecimal bd = new BigDecimal(lexicalForm).stripTrailingZeros() ;
         String lex2 = bd.toPlainString() ;
 
         // XSD canonical is "1"
         // but in Turtle the ".0" is need for short print form.
 
         // Ensure there is a "."
-        //if ( bd.scale() <= 0 )
         if ( lex2.indexOf('.') == -1 )
             // Must contain .0
             lex2 = lex2+".0" ;
         if ( lex2.equals(lexicalForm) )
             return node ;
-        return NodeFactory.createLiteral(lex2, datatype) ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
+    } ;
 
+    static DatatypeHandler dtDoubleTTL = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+        double d = XSDNumUtils.xsdParseDouble(lexicalForm) ;
+        String lex2 = XSDNumUtils.stringForm(d);
+        if ( lex2.equals(lexicalForm) )
+            return node ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
     } ;
 
-    static private DecimalFormatSymbols decimalNumberSymbols = new 
DecimalFormatSymbols(Locale.ROOT) ;
-    static private NumberFormat fmtFloatingPoint = new 
DecimalFormat("0.0#################E0", decimalNumberSymbols) ;
+    static DatatypeHandler dtFloatTTL = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+        float f = XSDNumUtils.xsdParseFloat(lexicalForm) ;
+        String lex2 = XSDNumUtils.stringForm(f);
+        if ( lex2.equals(lexicalForm) )
+            return node ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
+    } ;
 
-    /* http://www.w3.org/TR/xmlschema-2/#double-canonical-representation */
+    // --- XSD, more closely.
     /*
-     * The canonical representation for double is defined by prohibiting 
certain
-     * options from the Lexical representation (§3.2.5.1). Specifically, the
-     * exponent must be indicated by "E". Leading zeroes and the preceding
-     * optional "+" sign are prohibited in the exponent. If the exponent is
-     * zero, it must be indicated by "E0". For the mantissa, the preceding
-     * optional "+" sign is prohibited and the decimal point is required.
-     * Leading and trailing zeroes are prohibited subject to the following:
-     * number representations must be normalized such that there is a single
-     * digit which is non-zero to the left of the decimal point and at least a
-     * single digit to the right of the decimal point unless the value being
-     * represented is zero. The canonical representation for zero is 0.0E0.
+     * Format floats and double by using {@linkDecimalFormat}.

Review Comment:
   ```suggestion
        * Format floats and double by using {@link DecimalFormat}.
   ```



##########
jena-arq/src/main/java/org/apache/jena/riot/process/normalize/NormalizeRDFTerms.java:
##########
@@ -0,0 +1,278 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.riot.process.normalize;
+
+import java.util.HashMap ;
+import java.util.Map ;
+
+import org.apache.jena.datatypes.RDFDatatype ;
+import org.apache.jena.datatypes.xsd.XSDDatatype ;
+import org.apache.jena.graph.Node ;
+import org.apache.jena.graph.NodeFactory ;
+import org.apache.jena.riot.web.LangTag ;
+import org.apache.jena.sparql.util.NodeUtils ;
+import org.apache.jena.vocabulary.RDF ;
+
+/**
+ * Convert literals to normalized forms. Sometimes called canonicalization. 
There is
+ * one preferred RDFTerm for a given RDFterm "value" ("value" generalized to 
include
+ * URis and blank nodes). Mostly, this affects literals. Only certain 
datatypes are
+ * supported but applications can add normalization for other datatypes.
+ * <p>
+ * Various policies are provided:
+ * <ul>
+ * <li>{@link #get() General} (close to Turtle, use the turtle form for long 
term contract</li>
+ * <li>{@link #getTTL() Turtle}</li>
+ *  * <li>{@link #getXSD() XSD}</li>
+ * <li>XSD - follows XSD 1.1. Mantissa/exponents are adjusted. xsd;decimal of 
an
+ * integer value does not have a decimal point. and is not suitable for Turtle 
as a
+ * xsd;decimal short form.</li>

Review Comment:
   ```suggestion
    * xsd:decimal short form.</li>
   ```



##########
jena-arq/src/main/java/org/apache/jena/riot/process/normalize/NormalizeRDFTerms.java:
##########
@@ -0,0 +1,278 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.riot.process.normalize;
+
+import java.util.HashMap ;
+import java.util.Map ;
+
+import org.apache.jena.datatypes.RDFDatatype ;
+import org.apache.jena.datatypes.xsd.XSDDatatype ;
+import org.apache.jena.graph.Node ;
+import org.apache.jena.graph.NodeFactory ;
+import org.apache.jena.riot.web.LangTag ;
+import org.apache.jena.sparql.util.NodeUtils ;
+import org.apache.jena.vocabulary.RDF ;
+
+/**
+ * Convert literals to normalized forms. Sometimes called canonicalization. 
There is
+ * one preferred RDFTerm for a given RDFterm "value" ("value" generalized to 
include
+ * URis and blank nodes). Mostly, this affects literals. Only certain 
datatypes are
+ * supported but applications can add normalization for other datatypes.
+ * <p>
+ * Various policies are provided:
+ * <ul>
+ * <li>{@link #get() General} (close to Turtle, use the turtle form for long 
term contract</li>
+ * <li>{@link #getTTL() Turtle}</li>
+ *  * <li>{@link #getXSD() XSD}</li>
+ * <li>XSD - follows XSD 1.1. Mantissa/exponents are adjusted. xsd;decimal of 
an

Review Comment:
   ```suggestion
    * <li>XSD - follows XSD 1.1. Mantissa/exponents are adjusted. xsd:decimal 
of an
   ```



##########
jena-arq/src/main/java/org/apache/jena/sparql/util/XSDNumUtils.java:
##########
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.sparql.util;
+
+import java.math.BigDecimal;
+
+public class XSDNumUtils {
+
+    /**
+     * Parser an XSD double lexical form.

Review Comment:
   ```suggestion
        * Parse an XSD double lexical form.
   ```



##########
jena-arq/src/main/java/org/apache/jena/riot/process/normalize/NormalizeValue.java:
##########
@@ -128,80 +128,112 @@ class NormalizeValue
 
         if ( lex2.equals(lexicalForm) )
             return node ;
-        return NodeFactory.createLiteral(lex2, datatype) ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
     } ;
 
-    static DatatypeHandler dtDecimal = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+    static DatatypeHandler dtDecimalTTL = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+
+
+
+
         BigDecimal bd = new BigDecimal(lexicalForm).stripTrailingZeros() ;
         String lex2 = bd.toPlainString() ;
 
         // XSD canonical is "1"
         // but in Turtle the ".0" is need for short print form.
 
         // Ensure there is a "."
-        //if ( bd.scale() <= 0 )
         if ( lex2.indexOf('.') == -1 )
             // Must contain .0
             lex2 = lex2+".0" ;
         if ( lex2.equals(lexicalForm) )
             return node ;
-        return NodeFactory.createLiteral(lex2, datatype) ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
+    } ;
 
+    static DatatypeHandler dtDoubleTTL = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+        double d = XSDNumUtils.xsdParseDouble(lexicalForm) ;
+        String lex2 = XSDNumUtils.stringForm(d);
+        if ( lex2.equals(lexicalForm) )
+            return node ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
     } ;
 
-    static private DecimalFormatSymbols decimalNumberSymbols = new 
DecimalFormatSymbols(Locale.ROOT) ;
-    static private NumberFormat fmtFloatingPoint = new 
DecimalFormat("0.0#################E0", decimalNumberSymbols) ;
+    static DatatypeHandler dtFloatTTL = (Node node, String lexicalForm, 
RDFDatatype datatype) -> {
+        float f = XSDNumUtils.xsdParseFloat(lexicalForm) ;
+        String lex2 = XSDNumUtils.stringForm(f);
+        if ( lex2.equals(lexicalForm) )
+            return node ;
+        return NodeFactory.createLiteralDT(lex2, datatype) ;
+    } ;
 
-    /* http://www.w3.org/TR/xmlschema-2/#double-canonical-representation */
+    // --- XSD, more closely.
     /*
-     * The canonical representation for double is defined by prohibiting 
certain
-     * options from the Lexical representation (§3.2.5.1). Specifically, the
-     * exponent must be indicated by "E". Leading zeroes and the preceding
-     * optional "+" sign are prohibited in the exponent. If the exponent is
-     * zero, it must be indicated by "E0". For the mantissa, the preceding
-     * optional "+" sign is prohibited and the decimal point is required.
-     * Leading and trailing zeroes are prohibited subject to the following:
-     * number representations must be normalized such that there is a single
-     * digit which is non-zero to the left of the decimal point and at least a
-     * single digit to the right of the decimal point unless the value being
-     * represented is zero. The canonical representation for zero is 0.0E0.
+     * Format floats and double by using {@linkDecimalFormat}.
+     * This can move the decimal point and change the exponent value.
+     * All numbers are "n.nnnExxx".
+     * For "smaller"floats and double, Java formating as used by

Review Comment:
   ```suggestion
        * For "smaller" floats and double, Java formatting as used by
   ```



##########
jena-arq/src/main/java/org/apache/jena/sparql/util/XSDNumUtils.java:
##########
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.sparql.util;
+
+import java.math.BigDecimal;
+
+public class XSDNumUtils {
+
+    /**
+     * Parser an XSD double lexical form.
+     * Adds in the cases not covered by {@link Double#parseDouble}.
+     * {@code INF} is strictly upper case, but we accept lower case.
+     * {@code -NaN} and {@code +NaN} are not accepted.
+     */
+    public static double xsdParseDouble(String lexicalForm) {
+        // Generalized the lexical space. Java's Double.parseDouble does not 
cover everything.
+        return switch(lexicalForm ) {
+            case "INF",  "inf"  -> Double.POSITIVE_INFINITY;
+            case "+INF", "+inf" -> Double.POSITIVE_INFINITY;
+            case "-INF", "-inf" -> Double.NEGATIVE_INFINITY;
+            case "NaN"          -> Double.NaN ;
+            // Acceptable as Java doubles (value is "NaN" but not as xsd:double
+            case "-NaN"-> throw new NumberFormatException("-NaN is not valid 
as an xsd:double");
+            case "+NaN"-> throw new NumberFormatException("+NaN is not valid 
as an xsd:double");
+            // Includes +0 and -0.
+            default-> Double.parseDouble(lexicalForm);
+        };
+    }
+
+    /**
+     * Parser an XSD float lexical form.
+     * Adds in the cases not covered by {@link Float#parseFloat}.
+     * {@code INF} is strictly upper case, but we accept lower case.
+     * {@code -NaN} and {@code +NaN} are not accepted.
+     */
+    public static float xsdParseFloat(String lexicalForm) {
+        // Generalized the lexical space. Java's Float.parseFloat does not 
cover everything.
+        return switch(lexicalForm ) {
+            case "INF",  "inf"  -> Float.POSITIVE_INFINITY;
+            case "+INF", "+inf" -> Float.POSITIVE_INFINITY;
+            case "-INF", "-inf" -> Float.NEGATIVE_INFINITY;
+            case "NaN"          -> Float.NaN ;
+            // Acceptable as Java floats (value is "NaN" but not as xsd:float
+            case "-NaN"-> throw new NumberFormatException("-NaN is not valid 
as an xsd:float");
+            case "+NaN"-> throw new NumberFormatException("+NaN is not valid 
as an xsd:float");
+            // Includes +0 and -0.
+            default-> Float.parseFloat(lexicalForm);
+        };
+    }
+
+    /** Parse an XSD decimal. */
+    public static BigDecimal xsdParseDecimal(String lexicalForm) {
+        return new BigDecimal(lexicalForm);
+    }
+
+    /**
+     * Produce a lexical form for {@link BigDecimal} that is compatible with
+     * Turtle syntax (i.e it has a decimal point).
+     */
+    public static String stringForm(BigDecimal decimal) {
+        return XSDNumUtils.canonicalDecimalStrWithDot(decimal);
+    }
+
+    public static String stringForm(double d) {
+        if ( Double.isInfinite(d) ) {
+            if ( d < 0 )
+                return "-INF" ;
+            return "INF" ;
+        }
+
+        if ( Double.isNaN(d) )
+            return "NaN" ;
+
+        // Otherwise, SPARQL form always has exponent.
+        String x = Double.toString(d) ;
+        if ( (x.indexOf('e') != -1) || (x.indexOf('E') != -1) )
+            return x ;
+        // Must be 'e' to agree with TDB2 previous behaviour.
+        return x + "e0" ;
+    }
+
+    public static String stringForm(float f) {
+        if ( Float.isInfinite(f) ) {
+            if ( f < 0 )
+                return "-INF" ;
+            return "INF" ;
+        }
+
+        if ( Float.isNaN(f) )
+            return "NaN" ;
+
+        // No SPARQL short form
+        String x = Float.toString(f) ;
+        return x;
+    }
+
+    /**
+     * The format of {@code xsd:decimal} used in ARQ expression evaluation. 
This is
+     * XSD 1.0 for long-term consistency (integer values for {@code 
xsd;decimal} have

Review Comment:
   ```suggestion
        * XSD 1.0 for long-term consistency (integer values for {@code 
xsd:decimal} have
   ```



##########
jena-arq/src/main/java/org/apache/jena/sparql/util/XSDNumUtils.java:
##########
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.sparql.util;
+
+import java.math.BigDecimal;
+
+public class XSDNumUtils {
+
+    /**
+     * Parser an XSD double lexical form.
+     * Adds in the cases not covered by {@link Double#parseDouble}.
+     * {@code INF} is strictly upper case, but we accept lower case.
+     * {@code -NaN} and {@code +NaN} are not accepted.
+     */
+    public static double xsdParseDouble(String lexicalForm) {
+        // Generalized the lexical space. Java's Double.parseDouble does not 
cover everything.
+        return switch(lexicalForm ) {
+            case "INF",  "inf"  -> Double.POSITIVE_INFINITY;
+            case "+INF", "+inf" -> Double.POSITIVE_INFINITY;
+            case "-INF", "-inf" -> Double.NEGATIVE_INFINITY;
+            case "NaN"          -> Double.NaN ;
+            // Acceptable as Java doubles (value is "NaN" but not as xsd:double
+            case "-NaN"-> throw new NumberFormatException("-NaN is not valid 
as an xsd:double");
+            case "+NaN"-> throw new NumberFormatException("+NaN is not valid 
as an xsd:double");
+            // Includes +0 and -0.
+            default-> Double.parseDouble(lexicalForm);
+        };
+    }
+
+    /**
+     * Parser an XSD float lexical form.
+     * Adds in the cases not covered by {@link Float#parseFloat}.
+     * {@code INF} is strictly upper case, but we accept lower case.
+     * {@code -NaN} and {@code +NaN} are not accepted.
+     */
+    public static float xsdParseFloat(String lexicalForm) {
+        // Generalized the lexical space. Java's Float.parseFloat does not 
cover everything.
+        return switch(lexicalForm ) {
+            case "INF",  "inf"  -> Float.POSITIVE_INFINITY;
+            case "+INF", "+inf" -> Float.POSITIVE_INFINITY;
+            case "-INF", "-inf" -> Float.NEGATIVE_INFINITY;
+            case "NaN"          -> Float.NaN ;
+            // Acceptable as Java floats (value is "NaN" but not as xsd:float
+            case "-NaN"-> throw new NumberFormatException("-NaN is not valid 
as an xsd:float");
+            case "+NaN"-> throw new NumberFormatException("+NaN is not valid 
as an xsd:float");
+            // Includes +0 and -0.
+            default-> Float.parseFloat(lexicalForm);
+        };
+    }
+
+    /** Parse an XSD decimal. */
+    public static BigDecimal xsdParseDecimal(String lexicalForm) {
+        return new BigDecimal(lexicalForm);
+    }
+
+    /**
+     * Produce a lexical form for {@link BigDecimal} that is compatible with
+     * Turtle syntax (i.e it has a decimal point).
+     */
+    public static String stringForm(BigDecimal decimal) {
+        return XSDNumUtils.canonicalDecimalStrWithDot(decimal);
+    }
+
+    public static String stringForm(double d) {
+        if ( Double.isInfinite(d) ) {
+            if ( d < 0 )
+                return "-INF" ;
+            return "INF" ;
+        }
+
+        if ( Double.isNaN(d) )
+            return "NaN" ;
+
+        // Otherwise, SPARQL form always has exponent.
+        String x = Double.toString(d) ;
+        if ( (x.indexOf('e') != -1) || (x.indexOf('E') != -1) )
+            return x ;
+        // Must be 'e' to agree with TDB2 previous behaviour.
+        return x + "e0" ;
+    }
+
+    public static String stringForm(float f) {
+        if ( Float.isInfinite(f) ) {
+            if ( f < 0 )
+                return "-INF" ;
+            return "INF" ;
+        }
+
+        if ( Float.isNaN(f) )
+            return "NaN" ;
+
+        // No SPARQL short form
+        String x = Float.toString(f) ;
+        return x;
+    }
+
+    /**
+     * The format of {@code xsd:decimal} used in ARQ expression evaluation. 
This is
+     * XSD 1.0 for long-term consistency (integer values for {@code 
xsd;decimal} have
+     * ".0").
+     */
+    public static String stringFormatARQ(BigDecimal bd) {
+        return canonicalDecimalStrWithDot(bd);
+    }
+
+    /** Strict XSD 1.0 format for {@code xsd:decimal}. */
+    public static String stringFormatXSD10(BigDecimal bd) {
+        return canonicalDecimalStrWithDot(bd);
+    }
+
+    /** Strict XSD 1.1 format for {@code xsd:decimal}. */
+    public static String stringFormatXSD11(BigDecimal bd) {
+        return canonicalDecimalStrNoIntegerDot(bd);
+    }
+
+    /**
+     * Decimal format, cast-to-string.
+     * <p>
+     * Decimal canonical form where integer values have no ".0" (as in XSD 
1.1).
+     * <p>
+     * In XSD 1.1, canonical integer-valued decimal has a trailing ".0".
+     * In F&amp;O v 3.1, xs:string cast of a decimal which is integer valued, 
does
+     * not have the trailing ".0".
+     */
+    public static String canonicalDecimalStrNoIntegerDot(BigDecimal bd) {
+        if ( bd.signum() == 0 )
+            return "0";
+        if ( bd.scale() <= 0 )
+            // No decimal part.
+            return bd.toPlainString();
+        return bd.stripTrailingZeros().toPlainString();
+    }
+
+    /**
+     * Integer-valued decimals have a trailing ".0".
+     * (In XML Schema Datatype 1.1 they did not have a ".0".)
+     * <p>

Review Comment:
   ```suggestion
        *
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscr...@jena.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscr...@jena.apache.org
For additional commands, e-mail: pr-h...@jena.apache.org

Re: [PR] GH-2557: RDF term normalization [jena]

Reply via email to