[GitHub] [flink] JingGe commented on a change in pull request #17501: [Draft][FLINK-21406][RecordFormat] build AvroParquetRecordFormat for the new FileSource

GitBox Mon, 01 Nov 2021 09:21:59 -0700


JingGe commented on a change in pull request #17501:
URL: https://github.com/apache/flink/pull/17501#discussion_r740224352




##########
File path: flink-formats/flink-parquet/pom.xml
##########
@@ -45,6 +45,14 @@ under the License.
                        <scope>provided</scope>
                </dependency>
 
+               <!-- Flink-avro -->
+
+               <dependency>
+                       <groupId>org.apache.flink</groupId>
+                       <artifactId>flink-avro</artifactId>
+                       <version>${project.version}</version>

Review comment:
       sure, thanks @echauchot for the hint.

##########
File path: 
flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/RecordFormat.java
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.connector.file.src.reader;
+
+import org.apache.flink.annotation.PublicEvolving;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.connector.file.src.util.CheckpointedPosition;
+import org.apache.flink.core.fs.FSDataInputStream;
+import org.apache.flink.core.fs.FileStatus;
+import org.apache.flink.core.fs.FileSystem;
+import org.apache.flink.core.fs.Path;
+
+import java.io.IOException;
+
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+
+/**
+ * A reader format that reads individual records from a file via {@link Path}.
+ *
+ * <p>This interface teams up with its superinterface together build a 
2-levels API. {@link
+ * StreamFormat} focuses on abstract input stream and {@link RecordFormat} 
pays attention to the
+ * concrete FileSystem. This format is for cases where the readers need access 
to the file directly
+ * or need to create a custom stream. For readers that can directly work on 
input streams, consider
+ * using the superinterface {@link StreamFormat}.
+ *
+ * <p>Please refer the javadoc of {@link StreamFormat} for details.
+ *
+ * @param <T> - The type of records created by this format reader.
+ */
+@PublicEvolving
+public interface RecordFormat<T> extends StreamFormat<T> {
+
+    /**
+     * Creates a new reader to read in this format. This method is called when 
a fresh reader is
+     * created for a split that was assigned from the enumerator. This method 
may also be called on
+     * recovery from a checkpoint, if the reader never stored an offset in the 
checkpoint (see
+     * {@link #restoreReader(Configuration, Path, long, long, long)} for 
details.
+     *
+     * <p>Provide the default implementation, subclasses are therefore not 
forced to implement it.
+     * Compare to the {@link #createReader(Configuration, FSDataInputStream, 
long, long)}, This
+     * method put the focus on the {@link Path}. The default implementation 
adapts information given
+     * by method arguments to {@link FSDataInputStream} and calls {@link
+     * #createReader(Configuration, FSDataInputStream, long, long)}.
+     *
+     * <p>If the format is {@link #isSplittable() splittable}, then the {@code 
inputStream} is
+     * positioned to the beginning of the file split, otherwise it will be at 
position zero.
+     */
+    default StreamFormat.Reader<T> createReader(
+            Configuration config, Path filePath, long splitOffset, long 
splitLength)
+            throws IOException {
+
+        checkNotNull(filePath, "filePath");
+
+        final FileSystem fileSystem = filePath.getFileSystem();
+        final FileStatus fileStatus = fileSystem.getFileStatus(filePath);
+        final FSDataInputStream inputStream = fileSystem.open(filePath);
+
+        if (isSplittable()) {
+            inputStream.seek(splitOffset);
+        }

Review comment:
       should be actually checked in the `createReader(
               Configuration config, FSDataInputStream stream, long fileLen, 
long splitEnd)`, I will add it to make it more robust.

##########
File path: 
flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.formats.parquet.avro;
+
+import org.apache.flink.api.common.typeinfo.TypeInformation;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.connector.file.src.reader.RecordFormat;
+import org.apache.flink.core.fs.FSDataInputStream;
+import org.apache.flink.formats.avro.typeutils.GenericRecordAvroTypeInfo;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.parquet.avro.AvroParquetReader;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.apache.parquet.io.DelegatingSeekableInputStream;
+import org.apache.parquet.io.InputFile;
+import org.apache.parquet.io.SeekableInputStream;
+
+import javax.annotation.Nullable;
+
+import java.io.IOException;
+
+/** */
+public class AvroParquetRecordFormat implements RecordFormat<GenericRecord> {
+
+    private final transient Schema schema;
+
+    public AvroParquetRecordFormat(Schema schema) {
+        this.schema = schema;
+    }
+
+    /**
+     * Creates a new reader to read avro {@link GenericRecord} from Parquet 
input stream.
+     *
+     * <p>Several wrapper classes haven be created to Flink abstraction become 
compatible with the
+     * parquet abstraction. Please refer to the inner classes {@link 
GenericRecordReader}, {@link
+     * ParquetInputFile}, {@link FSDataInputStreamAdapter} for details.
+     */
+    @Override
+    public Reader<GenericRecord> createReader(
+            Configuration config, FSDataInputStream stream, long fileLen, long 
splitEnd)
+            throws IOException {
+
+        // current version does not support splitting.
+        checkNotSplit(fileLen, splitEnd);
+
+        return new GenericRecordReader(
+                AvroParquetReader.<GenericRecord>builder(new 
ParquetInputFile(stream, fileLen))
+                        .withDataModel(GenericData.get())
+                        .build());
+    }
+
+    /**
+     * Restores the reader from a checkpointed position. Since current version 
does not support
+     * splitting,
+     */
+    @Override
+    public Reader<GenericRecord> restoreReader(
+            Configuration config,
+            FSDataInputStream stream,
+            long restoredOffset,
+            long fileLen,
+            long splitEnd)
+            throws IOException {
+
+        // current version does not support splitting.
+        checkNotSplit(fileLen, splitEnd);
+
+        // current version just ignore the splitOffset and use restoredOffset
+        stream.seek(restoredOffset);
+
+        return createReader(config, stream, fileLen, splitEnd);
+    }
+
+    /** Current version does not support splitting. */
+    @Override
+    public boolean isSplittable() {
+        return false;
+    }
+
+    /**
+     * Gets the type produced by this format. This type will be the type 
produced by the file source
+     * as a whole.
+     */
+    @Override
+    public TypeInformation<GenericRecord> getProducedType() {
+        return new GenericRecordAvroTypeInfo(schema);
+    }
+
+    private static void checkNotSplit(long fileLen, long splitEnd) {
+        if (splitEnd != fileLen) {
+            throw new IllegalArgumentException(
+                    String.format(
+                            "Current version of AvroParquetRecordFormat is not 
splittable, "
+                                    + "but found split end (%d) different from 
file length (%d)",
+                            splitEnd, fileLen));
+        }
+    }
+
+    /**
+     * {@link RecordFormat.Reader} implementation. Using {@link ParquetReader} 
internally to read
+     * avro {@link GenericRecord} from parquet {@link InputFile}.
+     */
+    private static class GenericRecordReader implements 
RecordFormat.Reader<GenericRecord> {

Review comment:
       Because of the generic type `GenericRecord`. I will upgrade the reader 
to support custom types and rename the class.

##########
File path: 
flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.formats.parquet.avro;
+
+import org.apache.flink.api.common.typeinfo.TypeInformation;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.connector.file.src.reader.RecordFormat;
+import org.apache.flink.core.fs.FSDataInputStream;
+import org.apache.flink.formats.avro.typeutils.GenericRecordAvroTypeInfo;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.parquet.avro.AvroParquetReader;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.apache.parquet.io.DelegatingSeekableInputStream;
+import org.apache.parquet.io.InputFile;
+import org.apache.parquet.io.SeekableInputStream;
+
+import javax.annotation.Nullable;
+
+import java.io.IOException;
+
+/** */
+public class AvroParquetRecordFormat implements RecordFormat<GenericRecord> {
+
+    private final transient Schema schema;
+
+    public AvroParquetRecordFormat(Schema schema) {
+        this.schema = schema;
+    }
+
+    /**
+     * Creates a new reader to read avro {@link GenericRecord} from Parquet 
input stream.
+     *
+     * <p>Several wrapper classes haven be created to Flink abstraction become 
compatible with the
+     * parquet abstraction. Please refer to the inner classes {@link 
GenericRecordReader}, {@link
+     * ParquetInputFile}, {@link FSDataInputStreamAdapter} for details.
+     */
+    @Override
+    public Reader<GenericRecord> createReader(
+            Configuration config, FSDataInputStream stream, long fileLen, long 
splitEnd)
+            throws IOException {
+
+        // current version does not support splitting.
+        checkNotSplit(fileLen, splitEnd);
+
+        return new GenericRecordReader(
+                AvroParquetReader.<GenericRecord>builder(new 
ParquetInputFile(stream, fileLen))
+                        .withDataModel(GenericData.get())
+                        .build());
+    }
+
+    /**
+     * Restores the reader from a checkpointed position. Since current version 
does not support
+     * splitting,
+     */
+    @Override
+    public Reader<GenericRecord> restoreReader(
+            Configuration config,
+            FSDataInputStream stream,
+            long restoredOffset,
+            long fileLen,
+            long splitEnd)
+            throws IOException {
+
+        // current version does not support splitting.
+        checkNotSplit(fileLen, splitEnd);
+
+        // current version just ignore the splitOffset and use restoredOffset
+        stream.seek(restoredOffset);
+
+        return createReader(config, stream, fileLen, splitEnd);
+    }
+
+    /** Current version does not support splitting. */
+    @Override
+    public boolean isSplittable() {
+        return false;

Review comment:
       Splitting will be supported in a follow-up ticket.

##########
File path: 
flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/RecordFormat.java
##########
@@ -0,0 +1,130 @@
+/*

Review comment:
       sure, I will merge the methods into the `StreamFormat`

##########
File path: 
flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.formats.parquet.avro;
+
+import org.apache.flink.api.common.typeinfo.TypeInformation;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.connector.file.src.reader.RecordFormat;
+import org.apache.flink.core.fs.FSDataInputStream;
+import org.apache.flink.formats.avro.typeutils.GenericRecordAvroTypeInfo;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.parquet.avro.AvroParquetReader;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.apache.parquet.io.DelegatingSeekableInputStream;
+import org.apache.parquet.io.InputFile;
+import org.apache.parquet.io.SeekableInputStream;
+
+import javax.annotation.Nullable;
+
+import java.io.IOException;
+
+/** */
+public class AvroParquetRecordFormat implements RecordFormat<GenericRecord> {
+
+    private final transient Schema schema;

Review comment:
       We are using 1.10.0 now, the `Schema` is serializable. Just out of 
curiosity, since the transient keyword has been used, I'd like to understand 
the Flink concept and therefor the reason why do we care about the serializable 
here. Will the Format object be created locally on each TM or will it created 
and transferred to TMs via network, which will need ser/des? Thanks.

##########
File path: flink-formats/flink-avro/pom.xml
##########
@@ -26,7 +26,7 @@ under the License.
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-formats</artifactId>
                <version>1.15-SNAPSHOT</version>
-               <relativePath>..</relativePath>

Review comment:
       ok, I will create a new ticket for it.

##########
File path: 
flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/RecordFormat.java
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.connector.file.src.reader;
+
+import org.apache.flink.annotation.PublicEvolving;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.connector.file.src.util.CheckpointedPosition;
+import org.apache.flink.core.fs.FSDataInputStream;
+import org.apache.flink.core.fs.FileStatus;
+import org.apache.flink.core.fs.FileSystem;
+import org.apache.flink.core.fs.Path;
+
+import java.io.IOException;
+
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+
+/**
+ * A reader format that reads individual records from a file via {@link Path}.
+ *
+ * <p>This interface teams up with its superinterface together build a 
2-levels API. {@link
+ * StreamFormat} focuses on abstract input stream and {@link RecordFormat} 
pays attention to the
+ * concrete FileSystem. This format is for cases where the readers need access 
to the file directly
+ * or need to create a custom stream. For readers that can directly work on 
input streams, consider
+ * using the superinterface {@link StreamFormat}.
+ *
+ * <p>Please refer the javadoc of {@link StreamFormat} for details.
+ *
+ * @param <T> - The type of records created by this format reader.
+ */
+@PublicEvolving
+public interface RecordFormat<T> extends StreamFormat<T> {
+
+    /**
+     * Creates a new reader to read in this format. This method is called when 
a fresh reader is
+     * created for a split that was assigned from the enumerator. This method 
may also be called on
+     * recovery from a checkpoint, if the reader never stored an offset in the 
checkpoint (see
+     * {@link #restoreReader(Configuration, Path, long, long, long)} for 
details.
+     *
+     * <p>Provide the default implementation, subclasses are therefore not 
forced to implement it.
+     * Compare to the {@link #createReader(Configuration, FSDataInputStream, 
long, long)}, This
+     * method put the focus on the {@link Path}. The default implementation 
adapts information given
+     * by method arguments to {@link FSDataInputStream} and calls {@link
+     * #createReader(Configuration, FSDataInputStream, long, long)}.
+     *
+     * <p>If the format is {@link #isSplittable() splittable}, then the {@code 
inputStream} is
+     * positioned to the beginning of the file split, otherwise it will be at 
position zero.
+     */
+    default StreamFormat.Reader<T> createReader(

Review comment:
       The original idea was to replace the `FileRecordFormat` with this 
`RecordFormat`. Merging it into the `StreamFormat` is also fine.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] JingGe commented on a change in pull request #17501: [Draft][FLINK-21406][RecordFormat] build AvroParquetRecordFormat for the new FileSource

Reply via email to