Re: [PR] CSV-196-TrackBytePositions [commons-csv]

via GitHub Mon, 16 Dec 2024 08:07:19 -0800


garydgregory commented on code in PR #502:
URL: https://github.com/apache/commons-csv/pull/502#discussion_r1887045959



##########
src/main/java/org/apache/commons/csv/CSVParser.java:
##########
@@ -200,6 +201,17 @@ public Builder setRecordNumber(final long recordNumber) {
             return asThis();
         }
 
+        /**
+         * Sets whether to enable byte tracking for the parser.
+         *
+         * @param enableByteTracking {@code true} to enable byte tracking; 
{@code false} to disable it.
+         * @return this instance.
+         */

Review Comment:
   New `public` and `protected` elements should have a Javadoc tag of `@since 
1.13.0`.



##########
src/main/java/org/apache/commons/csv/CSVParser.java:
##########
@@ -506,11 +518,42 @@ public CSVParser(final Reader reader, final CSVFormat 
format) throws IOException
     @Deprecated
     @SuppressWarnings("resource")
     public CSVParser(final Reader reader, final CSVFormat format, final long 
characterOffset, final long recordNumber)
+        throws IOException {
+            this(reader, format, characterOffset, recordNumber, null, false);
+        }
+
+    /**
+     * Constructs a new instance using the given {@link CSVFormat}
+     *
+     * <p>
+     * If you do not read all records from the given {@code reader}, you 
should call {@link #close()} on the parser,
+     * unless you close the {@code reader}.
+     * </p>
+     *
+     * @param reader
+     *            a Reader containing CSV-formatted input. Must not be null.
+     * @param format
+     *            the CSVFormat used for CSV parsing. Must not be null.
+     * @param characterOffset
+     *            Lexer offset when the parser does not start parsing at the 
beginning of the source.
+     * @param recordNumber
+     *            The next record number to assign.
+     * @param charset
+     *            The character encoding to be used for the reader.
+     * @throws IllegalArgumentException
+     *             If the parameters of the format are inconsistent or if 
either the reader or format is null.
+     * @throws IOException
+     *             If there is a problem reading the header or skipping the 
first record.
+     * @throws CSVException Thrown on invalid input.
+     * @since 1.13.0.

Review Comment:
   New `private` elements  do not need a Javadoc `since` tag.



##########
src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java:
##########
@@ -118,11 +146,57 @@ public int read() throws IOException {
             current == EOF && lastChar != CR && lastChar != LF && lastChar != 
EOF) {
             lineNumber++;
         }
+        if (encoder != null) {
+            this.bytesRead += getCharBytes(current);
+        }
         lastChar = current;
         position++;
         return lastChar;
     }
 
+    /**
+     * Gets the byte length of the given character based on the the original 
Unicode
+     * specification, which defined characters as fixed-width 16-bit entities.
+     * <p>
+     * The Unicode characters are divided into two main ranges:
+     * <ul>
+     *   <li><b>U+0000 to U+FFFF (Basic Multilingual Plane, BMP):</b>
+     *     <ul>
+     *       <li>Represented using a single 16-bit {@code char}.</li>
+     *       <li>Includes UTF-8 encodings of 1-byte, 2-byte, and some 3-byte 
characters.</li>
+     *     </ul>
+     *   </li>
+     *   <li><b>U+10000 to U+10FFFF (Supplementary Characters):</b>
+     *     <ul>
+     *       <li>Represented as a pair of {@code char}s:</li>
+     *       <li>The first {@code char} is from the high-surrogates range 
(\uD800-\uDBFF).</li>
+     *       <li>The second {@code char} is from the low-surrogates range 
(\uDC00-\uDFFF).</li>
+     *       <li>Includes UTF-8 encodings of some 3-byte characters and all 
4-byte characters.</li>
+     *     </ul>
+     *   </li>
+     * </ul>
+     *
+     * @param current the current character to process.
+     * @return the byte length of the character.
+     * @throws CharacterCodingException if the character cannot be encoded.
+     */
+    private long getCharBytes(int current) throws CharacterCodingException {

Review Comment:
   The return type here should be an `int` because this API returns `0` or 
`CharBuffer.limit()` which itself returns an `int`.



##########
src/main/java/org/apache/commons/csv/CSVParser.java:
##########
@@ -506,11 +518,42 @@ public CSVParser(final Reader reader, final CSVFormat 
format) throws IOException
     @Deprecated
     @SuppressWarnings("resource")
     public CSVParser(final Reader reader, final CSVFormat format, final long 
characterOffset, final long recordNumber)
+        throws IOException {
+            this(reader, format, characterOffset, recordNumber, null, false);
+        }
+
+    /**
+     * Constructs a new instance using the given {@link CSVFormat}
+     *
+     * <p>
+     * If you do not read all records from the given {@code reader}, you 
should call {@link #close()} on the parser,
+     * unless you close the {@code reader}.
+     * </p>
+     *
+     * @param reader
+     *            a Reader containing CSV-formatted input. Must not be null.
+     * @param format
+     *            the CSVFormat used for CSV parsing. Must not be null.
+     * @param characterOffset
+     *            Lexer offset when the parser does not start parsing at the 
beginning of the source.
+     * @param recordNumber
+     *            The next record number to assign.
+     * @param charset
+     *            The character encoding to be used for the reader.

Review Comment:
   "The character encoding to be used for the reader."
   ->
   "The character encoding to be used for the reader when enableByteTracking is 
true."



##########
src/main/java/org/apache/commons/csv/CSVRecord.java:
##########
@@ -48,6 +48,11 @@ public final class CSVRecord implements Serializable, 
Iterable<String> {
      */
     private final long characterPosition;
 
+    /**
+     * The start byte of this record as a character byte in the source stream.
+     */
+    private final long characterByte;

Review Comment:
   This name is too confusing IMO, please rename to `bytePosition` to mirror 
the existing `characterPosition`.



##########
src/main/java/org/apache/commons/csv/CSVParser.java:
##########
@@ -506,11 +518,42 @@ public CSVParser(final Reader reader, final CSVFormat 
format) throws IOException
     @Deprecated
     @SuppressWarnings("resource")
     public CSVParser(final Reader reader, final CSVFormat format, final long 
characterOffset, final long recordNumber)
+        throws IOException {
+            this(reader, format, characterOffset, recordNumber, null, false);
+        }
+
+    /**
+     * Constructs a new instance using the given {@link CSVFormat}
+     *
+     * <p>
+     * If you do not read all records from the given {@code reader}, you 
should call {@link #close()} on the parser,
+     * unless you close the {@code reader}.
+     * </p>
+     *
+     * @param reader
+     *            a Reader containing CSV-formatted input. Must not be null.
+     * @param format
+     *            the CSVFormat used for CSV parsing. Must not be null.
+     * @param characterOffset
+     *            Lexer offset when the parser does not start parsing at the 
beginning of the source.
+     * @param recordNumber
+     *            The next record number to assign.
+     * @param charset
+     *            The character encoding to be used for the reader.
+     * @throws IllegalArgumentException
+     *             If the parameters of the format are inconsistent or if 
either the reader or format is null.
+     * @throws IOException
+     *             If there is a problem reading the header or skipping the 
first record.
+     * @throws CSVException Thrown on invalid input.
+     * @since 1.13.0.
+     */
+    private CSVParser(final Reader reader, final CSVFormat format, final long 
characterOffset, final long recordNumber,
+        final Charset charset, final boolean enableByteTracking)

Review Comment:
   Javadoc `@param` for `enableByteTracking` is missing.



##########
src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java:
##########
@@ -118,11 +146,57 @@ public int read() throws IOException {
             current == EOF && lastChar != CR && lastChar != LF && lastChar != 
EOF) {
             lineNumber++;
         }
+        if (encoder != null) {
+            this.bytesRead += getCharBytes(current);
+        }
         lastChar = current;
         position++;
         return lastChar;
     }
 
+    /**
+     * Gets the byte length of the given character based on the the original 
Unicode
+     * specification, which defined characters as fixed-width 16-bit entities.
+     * <p>
+     * The Unicode characters are divided into two main ranges:
+     * <ul>
+     *   <li><b>U+0000 to U+FFFF (Basic Multilingual Plane, BMP):</b>
+     *     <ul>
+     *       <li>Represented using a single 16-bit {@code char}.</li>
+     *       <li>Includes UTF-8 encodings of 1-byte, 2-byte, and some 3-byte 
characters.</li>
+     *     </ul>
+     *   </li>
+     *   <li><b>U+10000 to U+10FFFF (Supplementary Characters):</b>
+     *     <ul>
+     *       <li>Represented as a pair of {@code char}s:</li>
+     *       <li>The first {@code char} is from the high-surrogates range 
(\uD800-\uDBFF).</li>
+     *       <li>The second {@code char} is from the low-surrogates range 
(\uDC00-\uDFFF).</li>
+     *       <li>Includes UTF-8 encodings of some 3-byte characters and all 
4-byte characters.</li>
+     *     </ul>
+     *   </li>
+     * </ul>
+     *
+     * @param current the current character to process.
+     * @return the byte length of the character.
+     * @throws CharacterCodingException if the character cannot be encoded.
+     */
+    private long getCharBytes(int current) throws CharacterCodingException {

Review Comment:
   I think a better method name here is `getEncodedCharLength()`.



##########
src/main/java/org/apache/commons/csv/CSVRecord.java:
##########
@@ -67,8 +72,18 @@ public final class CSVRecord implements Serializable, 
Iterable<String> {
         this.parser = parser;
         this.comment = comment;
         this.characterPosition = characterPosition;
+        this.characterByte = 0L;
     }
 
+    CSVRecord(final CSVParser parser, final String[] values,  final String 
comment, final long recordNumber,

Review Comment:
   You only need one of the two package-private constructors, you can remove 
the old one and adapt the existing test.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] CSV-196-TrackBytePositions [commons-csv]

Reply via email to