[GitHub] mbeckerle commented on a change in pull request #61: Base64, gzip, and line-folding layering

GitBox Thu, 03 May 2018 10:44:46 -0700

mbeckerle commented on a change in pull request #61: Base64, gzip, and 
line-folding layering
URL: https://github.com/apache/incubator-daffodil/pull/61#discussion_r185883290


 ##########
 File path: 
daffodil-runtime1/src/main/scala/org/apache/daffodil/layers/LineFoldedTransformer.scala
 ##########
 @@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.layers
+
+import org.apache.daffodil.schema.annotation.props.gen.LayerLengthKind
+import org.apache.daffodil.schema.annotation.props.gen.LayerLengthUnits
+import org.apache.daffodil.util.Maybe
+import org.apache.daffodil.processors.TermRuntimeData
+import org.apache.daffodil.processors.LayerLengthInBytesEv
+import org.apache.daffodil.processors.LayerBoundaryMarkEv
+import org.apache.daffodil.processors.LayerCharsetEv
+import org.apache.daffodil.processors.parsers.PState
+import java.nio.charset.StandardCharsets
+import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.processors.unparsers.UState
+import org.apache.daffodil.io.LayerBoundaryMarkInsertingJavaOutputStream
+import java.io.OutputStream
+import java.io.InputStream
+import org.apache.daffodil.exceptions.ThrowsSDE
+import org.apache.daffodil.schema.annotation.props.Enum
+import org.apache.daffodil.io.RegexLimitingStream
+
+/*
+ * This and related classes implement so called "line folding" from
+ * IETF RFC 2822 Internet Message Format (IMF), and IETF RFC 5545 iCalendar.
+ *
+ * There are multiple varieties of line folding, and it is important to
+ * be specific about which algorithm.
+ *
+ * For IMF, unfolding simply removes CRLFs if they are followed by a space or 
tab.
+ * The Folding is more complex however, as CRLFs can only be inserted before
+ * a space/tab that appears in the data. If the data has no spaces, then no
+ * folding is possible.
+ * If there are spaces/tabs, the one closest to position 78 is used unless it 
is
+ * followed by punctuation, in which case a prior space/tab (if it exists) is 
used.
+ * (This preference for spaces not followed by punctuation is optional, it is
+ * not required, but is preferred in the IMF RFC.)
+ *
+ * Note: folding is done by some systems in a manner that does not respect
+ * character boundaries - i.e., in utf-8, a multi-byte character sequence may 
be
+ * broken in the middle by insertion of a CRLF. Hence, unfolding initially 
treats
+ * the text as iso-8859-1, i.e., just bytes, and removes CRLFs, then 
subsequently
+ * re-interprets the bytes as the expected charset such as utf-8.
+ *
+ * IMF is supposed to be US-ASCII, but implementations have gone to 8-bit 
characters
+ * being preserved, so the above problem can occur.
+ *
+ * IMF has a maximum line length of 998 characters per line excluding the CRLF.
+ * The layer will fail (cause a parse error) if a line longer than this is 
encountered
+ * or constructed after unfolding. When unparsing, if a line longer than 998 
cannot be
+ * folded due to no spaces/tabs being present in it, then it is an unparse 
error.
+ *
+ * Note that i/vCalendar, vCard, and MIME payloads held by IMF do not run into
+ * the IMF line length issues, in that they have their own line length limits 
that
+ * are smaller than those of IMF, and which do not require accomodation by 
having
+ * pre-existing spaces/tabs in the data. So such data *always* will be short
+ * enough lines.
+ *
+ * For vCard, iCalendar, and vCalendar, the maximum is 75 bytes plus the CRLF, 
for
+ * a total of 77. Folding is inserted by inserting CRLF + a space or tab. The
+ * CRLF and the following space or tab are removed to unfold. If data happened 
to
+ * contain a CRLF followed by a space or tab initially, then that will be lost 
when
+ * the data is parsed.
+ *
+ * For MIME, the maximum line length is 76.
+ */
+sealed trait LineFoldMode extends LineFoldMode.Value
+object LineFoldMode extends Enum[LineFoldMode] {
+
+  case object IMF extends LineFoldMode; forceConstruction(Left)
+  case object iCalendar extends LineFoldMode; forceConstruction(Right)
+
+  override def apply(name: String, context: ThrowsSDE): LineFoldMode = 
stringToEnum("lineFoldMode", name, context)
+}
+
+/**
+ * For line folded, the notion of "delimited" means that the element is a 
"line"
+ * that ends with CRLF, except that if it is long, it will be folded, which 
involves
+ * inserting/removing CRLF+Space (or CRLF+TAB). A CRLF not followed by space 
or tab
+ * is ALWAYS the actual "delimiter". There's no means of supplying a specific 
delimiter.
+ */
+class LineFoldedTransformerDelimited(mode: LineFoldMode)
+  extends LayerTransformer {
+
+  override protected def wrapLimitingStream(jis: java.io.InputStream, state: 
PState) = {
+    // regex means CRLF not followed by space or tab.
+    // NOTE: this regex cannot contain ANY capturing groups (per scaladoc on 
RegexLimitingStream)
+    val s = new RegexLimitingStream(jis, "\\r\\n(?!(?:\\t|\\ ))", "\r\n", 
StandardCharsets.ISO_8859_1)
+    s
+  }
+
+  override protected def wrapLimitingStream(jos: java.io.OutputStream, state: 
UState): java.io.OutputStream = {
+    //
+    // Q: How do we insert a CRLF "not followed by tab or space" when we don't
+    // control what follows?
+    // A: We don't. This is nature of the format. If what follows could begin
+    // with a space or tab, then the format can't use a line-folded layer.
+    //
+    val newJOS = new LayerBoundaryMarkInsertingJavaOutputStream(jos, "\r\n", 
StandardCharsets.ISO_8859_1)
+    newJOS
+  }
+
+  override protected def wrapLayerDecoder(jis: java.io.InputStream): 
java.io.InputStream = {
+    val s = new LineFoldedInputStream(mode, jis)
+    s
+  }
+  override protected def wrapLayerEncoder(jos: java.io.OutputStream): 
java.io.OutputStream = {
+    val s = new LineFoldedOutputStream(mode, jos)
+    s
+  }
+}
+
+/**
+ * For line folded, the 'implicit' length kind means that the region continues
+ * to end of data. At top level this would be the "whole file/stream" but this 
can
+ * also be used with a specified length enclosing element. This code cannot 
tell
+ * the difference.
+ */
+class LineFoldedTransformerImplicit(mode: LineFoldMode)
+  extends LayerTransformer {
+
+  override protected def wrapLimitingStream(jis: java.io.InputStream, state: 
PState) = {
+    jis // no limiting - just pull input until EOF.
+  }
+
+  override protected def wrapLimitingStream(jos: java.io.OutputStream, state: 
UState): java.io.OutputStream = {
+    jos // no limiting - just write output until EOF.
+  }
+
+  override protected def wrapLayerDecoder(jis: java.io.InputStream): 
java.io.InputStream = {
+    val s = new LineFoldedInputStream(mode, jis)
+    s
+  }
+  override protected def wrapLayerEncoder(jos: java.io.OutputStream): 
java.io.OutputStream = {
+    val s = new LineFoldedOutputStream(mode, jos)
+    s
+  }
+}
+
+sealed abstract class LineFoldedTransformerFactory(mode: LineFoldMode, name: 
String)
+  extends LayerTransformerFactory(name) {
+
+  override def newInstance(maybeLayerCharsetEv: Maybe[LayerCharsetEv],
+    maybeLayerLengthKind: Maybe[LayerLengthKind],
+    maybeLayerLengthInBytesEv: Maybe[LayerLengthInBytesEv],
+    maybeLayerLengthUnits: Maybe[LayerLengthUnits],
+    maybeLayerBoundaryMarkEv: Maybe[LayerBoundaryMarkEv],
+    trd: TermRuntimeData): LayerTransformer = {
+
+    trd.schemaDefinitionUnless(maybeLayerLengthKind.isDefined,
+      "The propert dfdl:layerLengthKind must be defined.")
+
+    val xformer =
+      maybeLayerLengthKind.get match {
+        case LayerLengthKind.BoundaryMark => {
+          new LineFoldedTransformerDelimited(mode)
+        }
+        case LayerLengthKind.Implicit => {
+          new LineFoldedTransformerImplicit(mode)
+        }
+        case x =>
+          trd.SDE("Property dfdl:layerLengthKind can only be 'implicit' or 
'boundaryMark', but was '%s'",
+            x.toString)
+      }
+    xformer
+  }
+}
+
+object IMFLineFoldedTransformerFactory
+  extends LineFoldedTransformerFactory(LineFoldMode.IMF, "lineFolded_IMF")
+
+object ICalendarLineFoldedTransformerFactory
+  extends LineFoldedTransformerFactory(LineFoldMode.iCalendar, 
"lineFolded_iCalendar")
+
+/**
+ * Doesn't enforce 998 max line length limit.
+ *
+ * This is a state machine, so of course must be used only on a single thread.
+ */
+class LineFoldedInputStream(mode: LineFoldMode, jis: InputStream)
+  extends InputStream {
+
+  object State extends org.apache.daffodil.util.Enum {
+    abstract sealed trait Type extends EnumValueType
+
+    /**
+     * No state. Read a character, and if CR, go to state GotCR.
+     */
+    case object Start extends Type
+
+    /**
+     * Read another character and if LF go to state GotCRLF.
+     */
+    case object GotCR extends Type
+
+    /**
+     * Read another character and if SP/TAB then what we do depends on
+     * IMF or iCalendar mode.
+     *
+     * In iCalendar mode we just goto Start, and iterate
+     * again. effectively absorbing all the CR, LF, and the sp/tab.
+     *
+     * In IMF mode we change state to Start, but we return the sp/tab so that
+     * we've effectively absorbed the CRLF, but not the space/tab character.
+     */
+    case object GotCRLF extends Type
+
+    /**
+     * We have a single saved character. Return it, go to Start state
+     */
+    case object Buf1 extends Type
+
+    /**
+     * We have 2 saved characters. They must be a LF, then the next character.
+     * Return the LF and go to state Buf1.
+     */
+    case object Buf2 extends Type
+
+    /**
+     * Done. Always return -1, stay in state Done
+     */
+    case object Done extends Type
+  }
+
+  private var c: Int = -2
+  private var state: State.Type = State.Start
+
+  /**
+   * Assumes an ascii-family encoding, but reads it byte at a time regardless
+   * of the encoding. This enables it to handle data where a CRLF was inserted
+   * to limit line length, and that insertion broke up a multi-byte character.
+   *
+   * Does not detect errors such as isolated \r or isolated \n. Leaves those
+   * alone. Does not care if lines are in fact less than any limit in length.
+   *
+   * All this does is remove \r\n[\ \t], replacing with just the space or 
tab.(IMF)
+   * or replace with nothing (iCalendar).
+   *
+   */
+  override def read(): Int = {
+    import State._
+    while (state != Done) {
+      state match {
+        case Start => {
+          c = jis.read()
+          c match {
+            case -1 => {
+              state = Done
+              return -1
+            }
+            case '\r' => {
+              state = GotCR
+            }
+            case _ => {
+              // state stays Start
+              return c
+            }
+          }
+        }
+        case GotCR => {
+          c = jis.read()
+          c match {
+            case -1 => {
+              state = Done
+              return '\r'
+            }
+            case '\n' => {
+              state = GotCRLF
+            }
+            case _ => {
+              state = Buf1
+              return c
 
 Review comment:
   Good catch. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] mbeckerle commented on a change in pull request #61: Base64, gzip, and line-folding layering

Reply via email to