stevedlawrence commented on code in PR #768:
URL: https://github.com/apache/daffodil/pull/768#discussion_r847274034
##########
daffodil-io/src/main/scala/org/apache/daffodil/processors/charset/CharsetUtils.scala:
##########
@@ -33,11 +33,13 @@ object CharsetUtils {
* encodings as well as the standard ones.
*/
def getCharset(name: String): BitsCharset = {
- val cs = DaffodilCharsetProvider.charsetForName(name)
- cs
+ val cs = BitsCharsetDefinitionRegistry.find(name).getOrElse(null)
+ if (cs == null)
+ SDE("The encoding '%s' was not found. Available choices are: %s", name,
BitsCharsetDefinitionRegistry.supportedEncodingsString)
Review Comment:
This changes the behavior of this function. In the past it returned null if
the encoding wasn't found and the caller was expected to handle it correctly.
I'd guess we want to keep that behavior, since some callers might just want to
know if a charset exists or not, and do something other than error if not.
Maybe something like this instead?
```scala
val cs =
BitsCharsetDefinitionRegistry.find(name).map(_.newInstance).getOrElse(null)
cs
```
##########
daffodil-runtime1/src/main/resources/META-INF/services/org.apache.daffodil.charsets.CharsetCompiler:
##########
@@ -0,0 +1,31 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# PLEASE NOTE:
+#
+# These are in src/test/resources not src/main/resources because they're not
+# "real" in the sense of being fully featured. E.g., AIS only does the payload
+# armoring, not everything else about AIS. IPv4 just does the IPv4 header, not
+# TCP, UDP, ICMP, etc. all of which have related checksum algorithms that
should
+# share code. A real CheckDigits has to have different layer length kinds
supported
+# so that one can compute check digits from elements with different
+# of length kinds.
+#
+# These are partial/subsets of "real" functionality that give us test coverage
+# of the layering features of Daffodil. They can be reused to create "real"
+# layering, but one should expect to have to adapt their code substantially.
+#
+org.apache.daffodil.processors.charset.CustomNonByteSizeCharsetCompiler
+org.apache.daffodil.processors.charset.CustomJavaCharsetCompiler
Review Comment:
The comments mention that these charsets are defined in src/test/resources
since they are only used for tests. But this services file is in
`src/main/resources`. I think this service file also needs to be in
`src/test/resources` so that these charsets are only loaded during tests.
##########
daffodil-io/src/main/scala/org/apache/daffodil/processors/charset/CharsetUtils.scala:
##########
@@ -69,6 +71,10 @@ object CharsetUtils {
}
val unicodeReplacementChar = '\uFFFD'
+
+ private def SDE(id: String, args: Any*): Nothing = {
+ throw new Exception(id.format(args: _*))
+ }
Review Comment:
We usually avoid throwing generic Exception's. It often makes it hard to
know how to best handle the error. There's also no context associated with this
SDE, like schema file or line number, which most SDE's have. I think that's
because it just doesn't exist here. Which makes me think more that CharsetUtils
should not be throwing any exceptions/SDE's but should instead rely on callers
to handle that.
##########
daffodil-io/src/main/scala/org/apache/daffodil/processors/charset/BitsCharsetDefinitionRegistry.scala:
##########
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.daffodil.processors.charset
+
+import org.apache.daffodil.util.SimpleNamedServiceLoader
+
+/*
+ * Finds all pluggable BitCharsets and makes them available to Daffodil after
they have been
+ * setup as described in BitsCharsetDefinition.scala
+ */
+object BitsCharsetDefinitionRegistry {
+
+ private lazy val bitsCharsetDefinitionMap: Map[String,
BitsCharsetDefinition] =
+
SimpleNamedServiceLoader.loadClass[BitsCharsetDefinition](classOf[BitsCharsetDefinition])
+
+ /**
+ * Given name, finds the BitsCharsetDefinition or null if not found
+ */
+ def find(name: String): Option[BitsCharsetDefinition] =
bitsCharsetDefinitionMap.get(name)
Review Comment:
This doesn't do any case insensitive stuff. I thought we supported lower or
upper case charset names? But maybe I'm wrong? Should we, or should all
encodings match exactly what is defined by the `name` function returned by the
definition?
##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/EvEncoding.scala:
##########
@@ -93,10 +93,11 @@ abstract class CharsetEvBase(encodingEv: EncodingEvBase,
tci: DPathCompileInfo)
override def compute(state: ParseOrUnparseState) = {
val encString = encodingEv.evaluate(state)
- val cs = CharsetUtils.getCharset(encString)
Review Comment:
Should this still use `CharsetUtils`, and `CharsetUtils` should be the only
thing that uses the registry? This way we can easily change how the registry
works if we ever need to since it's hidden behind `CharsetUtils`
##########
daffodil-io/src/main/scala/org/apache/daffodil/processors/charset/BitsCharsetDefinitionRegistry.scala:
##########
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.daffodil.processors.charset
+
+import org.apache.daffodil.util.SimpleNamedServiceLoader
+
+/*
+ * Finds all pluggable BitCharsets and makes them available to Daffodil after
they have been
+ * setup as described in BitsCharsetDefinition.scala
+ */
+object BitsCharsetDefinitionRegistry {
+
+ private lazy val bitsCharsetDefinitionMap: Map[String,
BitsCharsetDefinition] =
+
SimpleNamedServiceLoader.loadClass[BitsCharsetDefinition](classOf[BitsCharsetDefinition])
Review Comment:
All charsets have a name, but some also have aliases. For example,
`US-ASCII` is the official name, but we also support `ASCII`. This
SimpleNamedServiceLoader only knows about the name, which I think will break
aliases?
Does our `SimpleNamedServiceLoader` need to change so that it can somehow
support aliases?
##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/EvEncoding.scala:
##########
@@ -93,10 +93,11 @@ abstract class CharsetEvBase(encodingEv: EncodingEvBase,
tci: DPathCompileInfo)
override def compute(state: ParseOrUnparseState) = {
val encString = encodingEv.evaluate(state)
- val cs = CharsetUtils.getCharset(encString)
- if (cs == null) {
- tci.schemaDefinitionError("Unsupported encoding: %s. Supported
encodings: %s", encString, CharsetUtils.supportedEncodingsString)
+ val bcd = BitsCharsetDefinitionRegistry.find(encString).getOrElse(null)
+ if (bcd == null) {
+ tci.schemaDefinitionError("Unsupported encoding: %s. Supported
encodings: %s", encString,
BitsCharsetDefinitionRegistry.supportedEncodingsString)
Review Comment:
This isn't your change, but supportedEncodingString is potentially very
large (we have a lot of encodings) so this could be very verbose. I wonder if
we should instead just output the unsuported encoding? We can modify the
registry so that it does something like `log.debug` to output the encodings so
that if a user really can't figure it out they can enable debug logs to get
that information.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]