Re: [PR] CSV-196-TrackBytePositions [commons-csv]

via GitHub Thu, 07 Nov 2024 06:38:43 -0800


garydgregory commented on code in PR #502:
URL: https://github.com/apache/commons-csv/pull/502#discussion_r1832607476



##########
src/main/java/org/apache/commons/csv/CSVFormat.java:
##########
@@ -2097,6 +2097,30 @@ public CSVParser parse(final Reader reader) throws 
IOException {
         return new CSVParser(reader, this);
     }
 
+    /**
+     * Parses the specified content.
+     *
+     * <p>
+     * This method provides a way to parse CSV data from an input stream, 
starting at a specified character offset and record number,
+     * using a specified encoding. It returns a {@link CSVParser} that can be 
used to iterate over the parsed {@link CSVRecord}s.
+     * </p>
+     *
+     * <p>
+     * For additional parsing options, see the various static parse methods 
available on {@link CSVParser}.
+     * </p>
+     *
+     * @param reader the input stream
+     * @param characterOffset the character offset to start parsing from
+     * @param recordNumber the initial record number to start counting from
+     * @param encoding the character encoding of the input stream
+     * @return a parser over a stream of {@link CSVRecord}s.
+     * @throws IOException If an I/O error occurs
+     * @throws CSVException Thrown on invalid input.
+     */
+    public CSVParser parse(final Reader reader, final long characterOffset, 
final long recordNumber, String encoding) throws IOException {

Review Comment:
   No new public constructors please. You can augment the builder and add a 
private constructor.
   
   All new and protected elements should have a Java `@since 1.13.0`.



##########
src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java:
##########
@@ -49,13 +53,27 @@ final class ExtendedBufferedReader extends 
UnsynchronizedBufferedReader {
     private long position;
     private long positionMark;
 
+    /** The number of bytes read so far */
+    private long bytesRead;
+    private long bytesReadMark;
+
+    /** Encoder used to calculate the bytes of characters */
+    CharsetEncoder encoder;
+
     /**
      * Constructs a new instance using the default buffer size.
      */
     ExtendedBufferedReader(final Reader reader) {
         super(reader);
     }
 
+    ExtendedBufferedReader(final Reader reader, String encoding) {

Review Comment:
   Use a `Charset` instead of a Charset name.
   



##########
src/main/java/org/apache/commons/csv/CSVRecord.java:
##########
@@ -144,6 +159,15 @@ public long getCharacterPosition() {
         return characterPosition;
     }
 
+    /**
+     * Returns the start byte of this record as a character byte in the source 
stream.
+     *
+     * @return the start byte of this record as a character byte in the source 
stream.

Review Comment:
   Either the code or the comment is wrong. The return value is a `long` but 
the comment talks about a `byte`. It would help to make the comment clear as to 
why the return type is a `long` if that is indeed correct.



##########
src/test/java/org/apache/commons/csv/JiraCsv196Test.java:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.csv;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+
+import org.junit.jupiter.api.Test;
+
+
+public class JiraCsv196Test {
+    @Test
+    public void parseThreeBytes() throws IOException {
+

Review Comment:
   No need for extra blank line.



##########
src/main/java/org/apache/commons/csv/CSVRecord.java:
##########
@@ -48,6 +48,11 @@ public final class CSVRecord implements Serializable, 
Iterable<String> {
      */
     private final long characterPosition;
 
+    /**
+     * The start byte of this record as a character byte in the source stream.

Review Comment:
   This comment is confusing because it talks about a `byte` but the type is a 
`long`. It would be better to explain the mismatch, if that is indeed the 
intent.
   



##########
src/main/java/org/apache/commons/csv/CSVRecord.java:
##########
@@ -144,6 +159,15 @@ public long getCharacterPosition() {
         return characterPosition;
     }
 
+    /**
+     * Returns the start byte of this record as a character byte in the source 
stream.

Review Comment:
   Please follow the same Javadoc patterns as in other getter methods: A getter 
"Gets...".
   



##########
src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java:
##########
@@ -118,11 +137,43 @@ public int read() throws IOException {
             current == EOF && lastChar != CR && lastChar != LF && lastChar != 
EOF) {
             lineNumber++;
         }
+        if (encoder != null) {
+            this.bytesRead += getCharBytes(current);
+        }
         lastChar = current;
         position++;
         return lastChar;
     }
 
+    /**
+     *  In Java, a char data type are based on the original Unicode
+     *  specification, which defined characters as fixed-width 16-bit entities.
+     *   U+0000 to U+FFFF:
+     *     - BMP, represented using 1 16-bit char

Review Comment:
   Use Javadoc lists `<ol>` or `<ul>` if you expect the formatting to be 
maintained.
   



##########
src/test/java/org/apache/commons/csv/CSVParserTest.java:
##########
@@ -701,6 +701,84 @@ public void testGetHeaderComment_NoComment3() throws 
IOException {
         }
     }
 
+    @Test
+    public void testGetRecordThreeBytesRead() throws Exception {
+        String code = "id,date,val5,val4\n" +
+            "11111111111111,'4017-09-01',きちんと節分近くには咲いてる～,v4\n" +
+            "22222222222222,'4017-01-01',おはよう私の友人～,v4\n" +
+            "33333333333333,'4017-01-01',きる自然の力ってすごいな～,v4\n";
+        // String code = "'1',4";
+        // final CSVFormat format = CSVFormat.newFormat(',').withQuote('\'');

Review Comment:
   Are these comments needed?



##########
src/main/java/org/apache/commons/csv/CSVParser.java:
##########
@@ -511,10 +511,39 @@ public CSVParser(final Reader reader, final CSVFormat 
format) throws IOException
     @SuppressWarnings("resource")
     public CSVParser(final Reader reader, final CSVFormat format, final long 
characterOffset, final long recordNumber)
         throws IOException {
+            this(reader, format, characterOffset, recordNumber, null);
+        }
+
+        /**
+     * Constructs a new instance using the given {@link CSVFormat}
+     *
+     * <p>
+     * If you do not read all records from the given {@code reader}, you 
should call {@link #close()} on the parser,
+     * unless you close the {@code reader}.
+     * </p>
+     *
+     * @param reader
+     *            a Reader containing CSV-formatted input. Must not be null.
+     * @param format
+     *            the CSVFormat used for CSV parsing. Must not be null.
+     * @param characterOffset
+     *            Lexer offset when the parser does not start parsing at the 
beginning of the source.
+     * @param recordNumber
+     *            The next record number to assign
+     * @param encoding
+     *            The encoding to use for the reader
+     * @throws IllegalArgumentException
+     *             If the parameters of the format are inconsistent or if 
either the reader or format is null.
+     * @throws IOException
+     *             If there is a problem reading the header or skipping the 
first record
+     * @throws CSVException Thrown on invalid input.
+     */
+    public CSVParser(final Reader reader, final CSVFormat format, final long 
characterOffset, final long recordNumber,

Review Comment:
   See previous comment.



##########
src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java:
##########
@@ -49,13 +53,27 @@ final class ExtendedBufferedReader extends 
UnsynchronizedBufferedReader {
     private long position;
     private long positionMark;
 
+    /** The number of bytes read so far */
+    private long bytesRead;
+    private long bytesReadMark;
+
+    /** Encoder used to calculate the bytes of characters */
+    CharsetEncoder encoder;

Review Comment:
   Make this new instance variable private.



##########
src/test/java/org/apache/commons/csv/CSVParserTest.java:
##########
@@ -701,6 +701,84 @@ public void testGetHeaderComment_NoComment3() throws 
IOException {
         }
     }
 
+    @Test
+    public void testGetRecordThreeBytesRead() throws Exception {
+        String code = "id,date,val5,val4\n" +
+            "11111111111111,'4017-09-01',きちんと節分近くには咲いてる～,v4\n" +
+            "22222222222222,'4017-01-01',おはよう私の友人～,v4\n" +
+            "33333333333333,'4017-01-01',きる自然の力ってすごいな～,v4\n";
+        // String code = "'1',4";
+        // final CSVFormat format = CSVFormat.newFormat(',').withQuote('\'');
+        final CSVFormat format = CSVFormat.Builder.create()
+                               .setDelimiter(',')
+                               .setQuote('\'')
+                               .build();
+        // CSVParser parser = new CSVParser(new StringReader(code), format, 
0L, 1L, "UTF-8");
+        CSVParser parser =  format.parse(new StringReader(code), 0L, 1L, 
"UTF-8");
+
+        CSVRecord record = new CSVRecord(parser, null, null, 1L, 0L, 0L);
+        assertEquals(0, parser.getRecordNumber());
+        assertNotNull(record = parser.nextRecord());
+        assertEquals(1, record.getRecordNumber());
+        assertEquals(code.indexOf('i'), record.getCharacterPosition());
+        assertEquals(record.getCharacterByte(), record.getCharacterPosition());
+
+        assertNotNull(record = parser.nextRecord());
+        assertEquals(2, record.getRecordNumber());
+        assertEquals(code.indexOf('1'), record.getCharacterPosition());
+        assertEquals(record.getCharacterByte(), record.getCharacterPosition());
+
+        assertNotNull(record = parser.nextRecord());
+        assertEquals(3, record.getRecordNumber());
+        assertEquals(code.indexOf('2'), record.getCharacterPosition());
+        assertEquals(record.getCharacterByte(), 95);
+
+        assertNotNull(record = parser.nextRecord());
+        assertEquals(4, record.getRecordNumber());
+        assertEquals(code.indexOf('3'), record.getCharacterPosition());
+        assertEquals(record.getCharacterByte(), 154);
+
+        parser.close();

Review Comment:
   Use a try-with-resources block instead of manually closing a resource.
   



##########
src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java:
##########
@@ -118,11 +137,43 @@ public int read() throws IOException {
             current == EOF && lastChar != CR && lastChar != LF && lastChar != 
EOF) {
             lineNumber++;
         }
+        if (encoder != null) {
+            this.bytesRead += getCharBytes(current);
+        }
         lastChar = current;
         position++;
         return lastChar;
     }
 
+    /**
+     *  In Java, a char data type are based on the original Unicode
+     *  specification, which defined characters as fixed-width 16-bit entities.
+     *   U+0000 to U+FFFF:
+     *     - BMP, represented using 1 16-bit char
+     *     - Consists of UTF-8 1-byte, 2-byte, some 3-byte chars
+     *   U+10000 to U+10FFFF:
+     *     - Supplementary characters, represented as a pair of characters,
+     *     the first char from the high-surrogates range (\uD800-\uDBFF),
+     *     and the second char from the low-surrogates range (uDC00-\uDFFF).
+     *     - Consists of UTF-8 some 3-byte chars and 4-byte chars
+     */
+    private long getCharBytes(int current) throws CharacterCodingException {
+        char cChar = (char) current;

Review Comment:
   Use `final` where possible.



##########
src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java:
##########
@@ -49,13 +53,27 @@ final class ExtendedBufferedReader extends 
UnsynchronizedBufferedReader {
     private long position;
     private long positionMark;
 
+    /** The number of bytes read so far */

Review Comment:
   A sentence ends in a period `.`.



##########
src/main/java/org/apache/commons/csv/CSVParser.java:
##########
@@ -511,10 +511,39 @@ public CSVParser(final Reader reader, final CSVFormat 
format) throws IOException
     @SuppressWarnings("resource")
     public CSVParser(final Reader reader, final CSVFormat format, final long 
characterOffset, final long recordNumber)
         throws IOException {
+            this(reader, format, characterOffset, recordNumber, null);
+        }
+
+        /**
+     * Constructs a new instance using the given {@link CSVFormat}
+     *
+     * <p>
+     * If you do not read all records from the given {@code reader}, you 
should call {@link #close()} on the parser,
+     * unless you close the {@code reader}.
+     * </p>
+     *
+     * @param reader
+     *            a Reader containing CSV-formatted input. Must not be null.
+     * @param format
+     *            the CSVFormat used for CSV parsing. Must not be null.
+     * @param characterOffset
+     *            Lexer offset when the parser does not start parsing at the 
beginning of the source.
+     * @param recordNumber
+     *            The next record number to assign
+     * @param encoding
+     *            The encoding to use for the reader
+     * @throws IllegalArgumentException
+     *             If the parameters of the format are inconsistent or if 
either the reader or format is null.
+     * @throws IOException
+     *             If there is a problem reading the header or skipping the 
first record
+     * @throws CSVException Thrown on invalid input.
+     */
+    public CSVParser(final Reader reader, final CSVFormat format, final long 
characterOffset, final long recordNumber,
+        String encoding) throws IOException {

Review Comment:
   Use a `Charset`, not a charset name.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] CSV-196-TrackBytePositions [commons-csv]

Reply via email to